Text Classification

Purpose:
Natural language processing (NLP) has been widely popular, with the large amount of data available (in emails, web pages, sms) it becomes important to extract valuable information from textual data. An assortment of machine learning techniques designed to accomplish this task. With current advances in deep learning, we felt it would be an interesting idea to compare traditional and deep learning techniques. We decided to pick up a playground kaggle data set with the purpose of text classification and proceeded to implement both these types of algorithms for comparison purposes.

Problem

In today’s world, websites have to deal with toxic and divisive content. Especially major websites like Quora which cater to large traffic and their purpose is to provide a platform to people for asking and answering questions. A key challenge is to weed out insincere questions, those founded upon false premises or questions that intend to make a statement rather than look for helpful answers.

A question is classified as insincere if:

  • Non-neutral tone directed at someone
  • Discriminatory or contains abusive language
  • Contains false information

For more information regarding the challenge you can use the following link.

Code

The full code is available here.

Methodology

In this article we will tackle text classification by using machine learning and NLP techniques. For any data science problem with textual data the common steps include:

  • Data exploration
  • Text pre-processing
  • Feature engineering
    • Text sentiment
    • Topic modelling
    • TFIDF and Count Vectorizer
    • Text Descriptive Features
  • Model selection and Evaluation

Let’s explore them step by step in more detail.

Data Exploration

One of the most important steps of any project, you need to familiarize yourself with the data prior to implementing any modeling technique.

import os
print(os.listdir("../input"))

Our dataset includes:

  • train.csv – the training set
  • test.csv – the test set
  • sample_submission.csv – A sample submission in the correct format
  • embeddings – Folder containing word embeddings.

We are not allowed to use any external data sources. The following embeddings are given to us which can be used for building our models.

A look at the size of our train and test data:

  • Shape of train: (Rows 1,306,122 with 3 columns)
  • Shape of test: (Rows 56,370 with 2 columns)
What does the data look like?

In the target variable 1 represents the class Insincere and 0 the Sincere class of questions.

Let’s explore the distribution of the target variable:

import seaborn as sns
color = sns.color_palette()

%matplotlib inline

from plotly import tools
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

cnt_srs = train_df['target'].value_counts()
## target distribution ##
labels = (np.array(cnt_srs.index))
sizes = (np.array((cnt_srs / cnt_srs.sum())*100))

trace = go.Pie(labels=labels, values=sizes)
layout = go.Layout(
    title='Target distribution',
    font=dict(size=18),
    width=600,
    height=600,
)
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename="usertype")
1: Insincere & 0: Sincere

Box Plots:

These box plots shared below will help understand if there are any patterns in the dataset regarding the word count or the number of characters.

Per question Insincere questions have more words

Insincere questions > characters than sincere questions

Sincere questions have lesser punctuation’s

More upper case words in sincere questions

Word Clouds:

For questions classified as sincere we see general words like “will”, “one” and so on. We also see the word “will” prevalent for insincere questions. During the data processing steps we will have to treat common words. Another point brought out in the word cloud is how words like “Trump”,”liberal” are very specific to insincere words, possibly because the person is making a statement about these topics rather than genuinely providing an answer.

Sincere

Insincere

Text pre-processing

Usually unstructured text data will be dirty that is it will have misspelled words, case-insensitive words and various other issues. We need to clean the text and bring it to a standardized form before extracting information from it as without this step there will be noise resulting in a poor model.

Broadly, consider the following steps:

Tokenization:

Tokenization refers to the splitting of strings of text into smaller chunks or tokens. Paragraphs or large bodies of text are tokenized into sentences and then sentences are broken down into individual words.

Normalization:

This refers to a series of steps that transforms the corpus of text into a single standard and consistent form. The following steps are a part of this process:

  • Converting all letters to lowercase
  • Removing punctuation marks, numbers, stop words (a, is, will etc.)

Stemming, which involves chopping off the end of a word or inflectional endings (-ing, -ed etc.) to get its root form or stem, using crude heuristic rules.

burning -> burn.

Stemming generally works well most of the time, but can often return words which might not look correct intuitively.

difficulties -> difficulti

Lemmatization has the same goal as stemming. However, it uses a vocabulary and the morphological analysis of words, to remove inflectional endings and return the dictionary form of a word, known as the lemma. Unlike stemming, lemmatization aims to reduce the word properly so that it makes sense according to the language.

ran -> run, difficulties -> difficulty

The idea for stemming or lemmatization of words is to reduce words into a common form. For example, difficulties and difficulty will portray the same intent and context.

For our use case we have performed the following operations to clean the data (using the library NLTK):

  • Convert to lower-case.
  • Remove punctuation and numbers.
  • Removing Stop words: NLTK corpus contains 179 stop words such as “for”, “having”, “yours” and so on.
  • Lemmatize words
import nltk

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer 

#lower case
all_data['question_text'] = all_data['question_text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
#Removing Punctuation
all_data['question_text'] = all_data['question_text'].str.replace('[^\w\s]','')
#Removing numbers
all_data['question_text'] = all_data['question_text'].str.replace('[0-9]','')
#Removing stop words and words with length <=2
from nltk.corpus import stopwords
stop = stopwords.words('english')
all_data['question_text'] = all_data['question_text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop and len(x)>2))
# Lemmatize
from nltk.stem import WordNetLemmatizer
wl = WordNetLemmatizer()
all_data['question_text'] = all_data['question_text'].apply(lambda x: " ".join(wl.lemmatize(x,'v') for x in x.split()))

Feature Engineering

This part is what makes the difference between a good and a bad solution in any ML project. So what features can we create in our usecase. We can start with understanding the sentiment.

Text sentiment:

Sentiment is a part of opinion mining and it involves building a system to extract the opinion from a text. That is we wish to get a score to understand how positive or negative the text is.
The assumption with respect to our data set was perhaps the questions flagged as Insincere may contain toxic content and would exhibit a negative sentiment. However, as far as modeling features are concerned sentiment turned out to be a weak feature. On deeper evaluation, we noticed that there were several questions with high polarity scores with insincere and sincere tags.

Topic modelling:

Topic modelling is an approach to identify topics present across a corpus of text. A topic is defined as a repeating pattern of co-occurring terms in a corpus. A document contains multiple topics in varying proportions. So, for example, a document based on healthcare is more likely to contain a higher ratio of words like “doctor” and “surgery” than words such as “brakes” and “gear”, which indicate a theme of automobiles.

Using a technique like Latent Dirichlet Allocation to get the distribution of topics across the corpus would potentially help to get a sense of the themes discussed in the set of questions. Further, we hypothesize that there would be some difference between the topics of sincere and insincere questions.

The image below shows the distribution of the topics with respect to the different classes by taking an average.

Topic Distribution Sincere
Insincere Topic Distribution

Looking up top words from top topics from class insincere:

  • Topic 45: trump, part, president, donald, drink, similar, sport, websites, suffer, insurance, abroad, court, respect, would, wall.
  • topic 59: quora, question, ask, answer, wear, control, actually, treat, people, hear, worst, western, racist, many, opportunities.
  • Regarding topic 62: sex, hate, act, culture, pakistan, add, society, doctor, bring, present, people, search, pressure, characteristics, enjoy.
  • In topic 77: want, don’t, tell, guy, try, like, know, doesn’t, kill, people, say, let, brain, get, would.
  • For topic 79: women, men, white, black, water, watch, share, video, others, character, youtube, save, problem, prevent, people.

Looking up top words from top topics from class sincere:

  • topic 0: use, like, best, possible, make, come, cause, good, become, would, singer, get, know, happen, etc.
  • topic 1: make, use, like, best, cause, good, happen, many, would, find, better, nutritional, jar, work, venus.
  • For topic 34: job, engineer, company, chinese, get, work, project, interview, graduate, best, include, india, example, good, accord.
  • Regards to topic 56: someone, feel, love, man, process, like, post, view, would, care, else, give, advice, step, night.
  • For topic 77: want, dont, tell, guy, try, like, know, doesnt, kill, people, say, let, brain, get, would.

Count Vectorizer/tf-idf:

Countvectorizer returns a matrix which shows the frequency of each term in the vocabulary per document. On the other hand, tf-idf (term frequency-inverse document frequency) evaluate how important a word is to a document in the corpus.

tf(x) = (Number of times term x appears in a document) / (Total number of terms in the document)

idf(x) = log(Total number of documents / Number of documents with term x in it)

tf-idf = *tf(x) * idf(x)*

Clearly, the importance of a word in a document increases proportionally to the number of times a it appears there. But, it is offset by the number of times it occurs in the corpus.

Both tf-idf and countvectorizer, as features, may indicate the relevance of a certain set of words to questions labelled as “Sincere” as well as “Insincere”.

The image below is obtained by using a TF-IDF vectorizer to create features and a k-fold CV logistic regression model and it shows the words (of insincere questions) with most weight.

Text Descriptive features

The idea behind building features such as the number of unique words, characters or exclamation points is to check for uniformity in the data set. We wish to observe if there are some similarities between the train and test set. Some questions that these meta features help answer include:

  • Is it that our test set consists of very small questions as compared to the train set?
  • A question framed insincerely might be haphazardly framed with disregard for the correct use of punctuations and possibly contain an abnormally high count.
  • A user writing a toxic or insincere question may be using uppercase letters very liberally.

The examples mentioned above give us the idea that there might be certain patterns specific to the respective classes that can be leveraged in our model. To give an ad hoc example of how useful meta features can be, on a musical note, the number of words per minute for Eminem is different based on the content/emotion of the song.

Some of the meta features are listed below:

  • Number of Words, Unique Words, Characters

We have added some box plots in the data exploration section to provide you with an idea regarding the prevalent distribution with respect to the different classes.

Model

So far we have cleaned up our text and carried out feature engineering. Now, there are several ways to select the relevant features however, for the purpose of this article we decided to generate separate models for each set of features as this will help develop a general understanding and help utilize these tactics on other text classification datasets.

A few things to note:

  • We are using F1 Score as our performance metric as required by the competition rules. It also gives us a better picture than accuracy keeping in mind the imbalance in the data.
  • For each model we are using 5 fold cross validation.
  • In order to find the suitable threshold (to convert the probabilities to a binary) we have developed a loop. In this loop we try multiple potential thresholds and choose the one that maximizes the F1 score. The F1 score is calculated on the validation data set.

There are two pieces of code that will be reused in most of the models:

kf = KFold(n_splits=5, shuffle=True, random_state=43)
## Initialize 0’s
test_pred_ots = 0
oof_pred_ots = np.zeros([train.shape[0],])

train_target = train['target'].values

x_test = test[selected_features].values


## Loop to split the data set
for i, (train_index, val_index) in tqdm(enumerate(kf.split(train))):
    x_train, x_val = train.loc[train_index][selected_features].values, 
    train.loc[val_index][selected_features].values
    y_train, y_val = train_target[train_index], train_target[val_index]
    
   # Model
    classifier = LogisticRegression(C= 0.1)
    classifier.fit(x_train, y_train)
    
    ## Validation set predicted
    val_preds = classifier.predict_proba(x_val)[:,1]
    
    ## Test set predictions
    preds = classifier.predict_proba(x_test)[:,1]
    test_pred_ots += 0.2*preds
    oof_pred_ots[val_index] = val_preds
print("--- %s seconds for Model Selected Features ---" % (time.time() - start_time))

The code above runs 5 fold cross validation and with each split we train and make predictions on the validation and test datasets. At the end of all splits we get oof_pred_ots which are predictions on the validation data sets combined into a single data frame. We also get the average prediction probabilities of each split in test_pred_ots.

thresh_opt_ots = 0.5
f1_opt = 0
for thresh in np.arange(0.1, 0.91, 0.01):
    thresh = np.round(thresh, 2)
    f1 = metrics.f1_score(train_target, (oof_pred_ots.astype(float) >thresh).astype(int))
    #print("F1 score at threshold {0} is {1}".format(thresh, f1))
    if f1_opt < f1:
        f1_opt = f1
        thresh_opt_ots = thresh
print(thresh_opt_ots)
pred_train_ots = (oof_pred_ots > thresh_opt_ots).astype(np.int)
f1_score(train_target, pred_train_ots)

The code above will help find the best threshold.

First Model:
We used the text descriptive features and ran a 5-fold cross validation logistic regression model however, the F1 score is not that significant (0.27).

Second Model:
We used the sentiment and topic modeling features and ran the same model as mentioned before. This time we got a better score (0.34).

Third Model:
We used TFIDF features and tried logistic regression (F1 – 0.587) and light gbm (F1 – 0.591). This is much better.

Fourth Model:
We used countvectorizer and tried logistic regression (F1 – 0.592) and a multinomial (F1 – 0.55) and bernoulli (F1 – 0.53) naive bayes models.

Ensemble:

The idea here is that one model might be observing patterns that the other isn’t. Further, ensemble will help get better results and at the same time reduce the chance of over fitting. We used stacking which means that we make predictions on the entire train set. This is accomplished by splitting data at each folds into train and holdout set and making predictions on the holdout set. This splitting of the data is carried out such that there is a prediction for each row in the train data set.
We use these new predictions from the respective models as input variables and run another (logistic regression) model on top of this giving us the final probabilities.
Our final F1 Score (0.604) and on the leader-board (0.589).

What’s Next

In the next article we will implement a deep learning approach to the same use case and draw comparisons between the two methodologies.

References:

About Us

Data science discovery is a step on the path of your data science journey. Please follow us on LinkedIn to stay updated.

About the writers:

  • Ujjayant Sinha: Data science enthusiast with interest in natural language problems.
  • Ankit Gadi: Driven by a knack and passion for data science coupled with a strong foundation in Operations Research and Statistics has helped me embark on my data science journey.
Please follow and like us:

Leave a Reply

error

Enjoy this blog? Please spread the word :)

error: Content is protected !!