Sentiment Analysis with Naive Bayes Classifier
Classification is a fundamental aspect of both human and machine intelligence. In this blog, we'll explore a specific text classification problem: sentiment analysis. The most common approach to text classification in natural language processing involves supervised machine learning. Formally, the task of supervised classification is to take an input
Many kinds of machine learning algorithms are used to build classifiers. This blog we are going to use Naive Bayes it belongs to a family of Generative classifiers
It is called Naive Bayes beacuse it is a bayesian classifier that makes a naive assumption that the features are independent of each other
For this blog we are going to use Twitter sentiment analysis dataset from Kaggle
Before implementing the Naive Bayes classifier, let's first understand the theory behind it.
Naive Bayes is a probabilistic classifier, meaning that for a document
where:
The denominator
Substituting the Bayes' theorem in the above equation, we get:
the denominator
The likelihood
where:
The prior
where:
Now that we have the likelihood and the prior, we can calculate the posterior probability of each class for a given document. The class with the maximum posterior probability is the predicted class for the document.
To apply the naive Bayes classifier to text, we will use each word in the documents as a feature, as suggested above, and we consider each of the words in the document by walking an index through every word position in the document
where:
$ i \in \text{positions}
To calculate the probability of
where:
But the above equation can be problematic when we encounter a word that is not present in the training set.
For example, Imagine we are trying to classify a document that contains the word "apple", but the word "apple" was not present in the training set. In this case, the probability
where:
$ \mid V \mid $ : Size of the vocabulary (total number of unique words in the training set)
Let's walkthrough a simple example to understand how the Naive Bayes classifier works. Consider a training set with two classes: positive and negative. The training set contains the following documents:
| Document | Class |
|---|---|
| I love this movie | positive |
| Just plain boring | negative |
| most fun movie ever | positive |
| no fun at all | negative |
vocabulary for positive class: {I, love, this, movie, most, fun, ever}
vocabulary for negative class: {just, plain, boring, no, fun, at, all}
complete vocabulary:
Given the above training set, we want to classify the document "I had fun". To do this, we first calculate the prior probabilities of each class:
Next, we calculate the likelihood of observing the document "I had fun" given each class. We calculate the likelihood for each word in the document and multiply them together:
For the positive class:
For the negative class:
Finally, we calculate the posterior probability of each class for the document "I had fun" and predict the class with the maximum posterior probability:
Since
Now that we have a good understanding of the theory behind the Naive Bayes classifier, let's implement it in Python using the Twitter sentiment analysis dataset.
First we need to load the dataset and preprocess it. We will also split the dataset into training and testing sets.
data = pd.read_csv('twitter_training.csv',header=None,usecols=[2,3],names=['sentiment','text'])
data = data.sample(frac=0.4).reset_index(drop=True)
data.dropna(inplace=True)
Next, we need to preprocess the text data by tokenizing the text and removing stopwords. and then we will convert the text data into a matrix of token counts using the CountVectorizer class from scikit-learn.(It is an implementation of the bag of words model)
def tokenize(text):
return [word for word in word_tokenize(text) if word.isalpha() and word not in stop_words]
vectorizer = CountVectorizer(tokenizer=tokenize)
X = vectorizer.fit_transform(data.text).toarray()
y = (data.sentiment == 'Positive').astype(int)
Next, we will split the data into training and testing sets using the train_test_split function from scikit-learn.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now we can train the Naive Bayes classifier using the training data. We will use the MultinomialNB class from scikit-learn to train the classifier.
nb = MultinomialNB()
nb.fit(X_train, y_train)
Finally, we can evaluate the performance of the classifier on the testing data using the score method.
nb.score(X_test, y_test)
The score method returns the mean accuracy of the classifier on the testing data. In this case, the accuracy represents the proportion of correctly classified documents in the testing set.
Although Accuracy is a good metric to evaluate the performance of the classifier, it is not always the best metric, especially when the classes are imbalanced. Other metrics such as precision, recall, and F1 score can provide more insights into the performance of the classifier.
Precision is the proportion of true positive predictions among all positive predictions:
Recall is the proportion of true positive predictions among all actual positive instances:
The F1 score is the harmonic mean of precision and recall:
We can calculate these metrics using the precision_score, recall_score, and f1_score functions from scikit-learn.
print(f'Accuracy: {accuracy_score(y_test, nb.predict(X_test)):0.2f}')
print(f'F1: {f1_score(y_test, nb.predict(X_test)):0.2f}')
print(f'Precision: {precision_score(y_test, nb.predict(X_test)):0.2f}')
print(f'Recall: {recall_score(y_test, nb.predict(X_test)):0.2f}')
we get the following output:
Accuracy: 0.85
F1: 0.84
Precision: 0.84
Recall: 0.83
Before concluding, I want to highlight one more important aspect of Naive Bayes: it is a generative classifier. This means that it models the joint probability distribution of the features and the class, allowing it to generate new samples from the learned distribution. This capability can be particularly useful in scenarios where we need to create new samples that resemble the training data.
let's generate a new sample using the learned naive bayes model.
import numpy as np
generated_class = 1
generated_features = np.random.multinomial(n=1, pvals=np.exp(nb.feature_log_prob_[generated_class]))
generated_data_point = vectorizer.inverse_transform(generated_features.reshape(1, -1))
print("Generated Data Point:", generated_data_point[0][0])
output:
Generated Data Point: interesting
In this blog, we explored the theory behind the Naive Bayes classifier and implemented it in Python using the Twitter sentiment analysis dataset. We also evaluated the performance of the classifier using accuracy, precision, recall, and F1 score. Finally, we generated a new sample using the learned Naive Bayes model.