Simple Sentiment Analysis using Naive Bayes and Logistic Regression

Sentiment Analysis refers to the use of Machine Learning and Natural Language Processing (NLP) to systematically detect emotions in text. In recent years, sentiment analysis found broad adoption across industries. One reason for its popularity is, that it is increasingly crucial for business to understand their customers, who have become used to expressing their opinions via the Internet and especially social media. AI-driven sentiment analysis enables businesses to systematically structure and use this information in their key business functions, such as marketing, product development and customer service.

The increasing relevance of sentiment analysis in social media and in the business context has motivated me to kickoff a separate series on sentiment analysis as a subdomain of machine learning. This blog post starts with a short introduction to the concept of sentiment analysis, before it demonstrates how to implement a sentiment classifier in Python using Naive Bayes and Logistic Regression.

Functionality of a sentiment classifier with three classes
Functionality of a sentiment classifier with three classes

Basics of Sentiment Classification

A basic task of sentiment analysis is to analyse sequences or paragraphs of text and measure the emotions expressed on a scale. In this way, it is possible to measure the emotions towards a certain topic, e.g. towards products, brands, political parties, services, or trends.

How sentiment analysis works can be shown through the following example. Consider the following text sequences:

  • This product is great!
  • I don’t like this ice cream at all.
  • Yesterday I’ve seen a dolphin.

While the first sentence clearly denotes a positive sentiment, the second sentence is negative, and in the third sentence, the sentiment is neutral. A sentiment classifier can automatically label these sentences:

Text SequenceSentiment Label
This product is great!POSITIVE
I don’t like this ice cream at all.NEGATIVE
Yesterday I’ve seen a dolphin.NEUTRAL
Sentiment Labels of Text Sequences

Predicting sentiment classes, opens the door to more advanced statistical analysis and automated text processing.

Use Cases for Sentiment Analysis

Sentiment analysis is used in various application domains:

  • Sentiment analysis can lead to a more efficient and better customer service by prioritizing customer requests. For example, when customers complain about services or products, an algorithm can identify and prioritize these messages, so that sales agents answer them first. This can increase customer satisfaction and reduce the churn rate.
  • Twitter and Amazon reviews have become the first port of call for many customers today when it comes to exchanging information about products, brands and trends, or expressing their own opinions. A sentiment classifier enables businesses to systematically evaluate social media posts and product reviews in real-time. In this way, for example, marketing managers can quickly obtain feedback on how well customers perceive campaigns and ads.
  • In stock market prediction analyse sentiment of social media or news feeds towards stocks or brands. The sentiment is then used as an additional feature alongside price data to create better forecasting models. Some forecasting approaches also exclusively rely on sentiment.

Sentiment Analysis will certainly find further adoption in the coming years. Especially in marketing and customer service, companies will increasingly use sentiment analysis to automate business processes and offer their customers a better customer experience.

Feature Modelling

An important step in the development of the Sentiment Classifier is language modeling. Before we can train a machine learning model, we need to bring the natural text into a structured format, that can be statistically assessed in the training process. Various modelling techniques exist for this purpose. The two most common models are bag-of-word and n-grams.

Bag-of-word model

The bag-of-word model calculates probability distributions over the number of unique words. This approach converts individual words into individual features. Fill words that have a low predictive power such as “the” or “a” will be filtered out. Consider the following text sample:

“Bob likes to play basketball. But his friend Daniel prefers to play soccer. “

Through filtering of fill words, this sample will be converted to:

“Bob”, “likes”, “play”, “basketball”, “friend”, “Daniel”, “play”, “soccer”.

In the next step, these words will be then converted into a normalized form, where each word becomes a column:

Text sample after transformation

The bag-of-word model is easy to implement. However, it does not consider grammar or word order.

n-gram model

The n-gram model considers multiple consecutive words in a text sequence together and thus captures word sequence. The n stands for the number of words considered.

For example, in a 2-gram model, the sentence

“Bob likes to play basketball. But his friend Daniel prefers to play soccer.”

will be converted to the following model:

“Bob likes”, “likes to”, “to play”, “play basketball” and so on. The n-gram model is often used to supplement the bag-of-word model. It is also possible to combine different n-gram models. For a 3-gram model the text would be converted to “Bob likes to”, “likes to play”, “to play basketball”, and so on. Combining multiple n-gram models, however can quickly increase model complexity.

Sentiment Classes and Model Training

The training of sentiment classifiers traditionally takes place in a supervised learning process. For this purpose, a training data set is used, which contains text sections with associated sentiment tendencies as prediction labels. Depending on which labels we provide together with the training data, the classifier will learn to predict sentiment on a more or less fine-grained scale. To be able to capture neutral sentiment as well, it is recommended to choose an odd number of classes.

More advanced classifiers can detect different sorts of emotions and for example detect whether someone expresses anger, happiness, sadness, and so on. It basically comes down to which prediction labels you provide with the training data

When the classifier is trained on a one-gram model, the classifier will learn that certain words such as “good” or “great” increase the probability that a text is associated with a positive sentiment. Consequently, when the classifier encounters these words in a new text sample, it will predict a higher probability of a positive sentiment. On the other hand, the classifier will learn that words such as “hate”, or “dislike” are often used to express negative opinions and thus increase the probability of a negative sentiment.

Language Complications

Is sentiment analysis really that simple? Well, not quite. The cases described so far were deliberately chosen to be very simple. However, human language is very complex, and many peculiarities make it more difficult in practice to identify the sentiment in a sentence or paragraph. Here are some examples:

  • Inversions: “this product is not so great”
  • Typos: “I live this product!”
  • Comparisons: “Product a is better than product z”.
  • Expression of pros and cons in a text passage: “An advantage is that. But on the other hand…”
  • Unknown vocabulary: “This product is just whuopii!”
  • Missing words: “How can you not this product?”

Fortunately, there are methods to solve the above mentioned complications. I will explain a bit more about them in one of my coming articles. But for now let’s stay with the basics and implement a simple classifier.

Implementing a Sentiment Classifier in Python

Now that we know the basics, we can turn to the hands-on part of this tutorial! In the following, you will be guided through the process of building a classification model for detecting the sentiment in twitter comments. We will make use of two different estimators: Logistic Regression and Naives Bayes. Finally, we will compare the prediction performance of the two models and make some test predictions.

The example covers the following steps:

  1. Defining the Problem
  2. Loading the Data
  3. Cleanups
  4. Exploring the Data
  5. Train two Sentiment Classifiers with Logistic Regression and Naive Bayes
  6. Comparing Classifier Performance
  7. Making Test Predictions

Python Environment

This tutorial assumes that you have setup your python environment. I recommend using the Anaconda environment. It is also assumed that you have the following packages installed: keras (2.0 or higher) with Tensorflow backend, numpypandasmatplotsklearn. You can install these packages with the console command:

pip install <packagename>

For a tutorial on setting up the Anaconda environment with required packaged follow this link

1) Defining the Problem

As always, when we start a new project, it is a good idea to first take some time and develop an initial idea of what we want to achieve. In this tutorial we will work with a data set provided by Kaggle, which contains tens of thousands tweets from Twitter. In addition, the sentiment for each comment is given as positive, neutral, or negative. Consequently, our goal is to create a sentiment classifier, which should be able to classify new text sequences into one of the three sentiment classes.

2) Loading the Data

Let’s begin with the technical part. First, we will download the data from the twitter sentiment example on the Kaggle website. If you are working with the Kaggle Python environment, you can also directly save the data into your Python project.

We will only use the following two csv files:

  • train.csv: contains the training data
  • test.csv: contains the test data for validation purposes

We will copy these two files into a folder that you can access from your Python environment. For simplicity, I recommend putting these files directly into the folder of your Python notebook. In case you put the csv files somewhere else, don’t forget to adjust the file path in the code below:

After you have copied the files into your Python environment, the next step is to load the data into your project and convert it into a Panda data frame. The following code performs these steps and then prints an overview of the data.

import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import matplotlib

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, multilabel_confusion_matrix
import scikitplot as skplt

import seaborn as sns

# Load the train data
train_path = "train.csv"
train_df = pd.read_csv(train_path) 

# Load the test data
sub_test_path = "test.csv"
test_df = pd.read_csv(sub_test_path) 

# Taking a quick look at the data
print(train_df.shape, test_df.shape)
train_df for sentiment analysis
Train data

The train.csv file contains four columns:

  • textID: An identifier
  • text: The raw text
  • selected_text: Contains a selected part of the original text
  • sentiment: Contains the prediction label

3) Cleanups

Before we start to build the classification model, we will transform the sentiment labels into numeric values. This is a best practice.

Three-class sentiment scale

The following code performs this job for us:

train_dfa = train_df.copy()
cleanup_nums = {"sentiment":     {"negative": 1, "neutral": 2, "positive": 3}}
train_dfa.replace(cleanup_nums, inplace=True)

4) Exploring the Data

It’s always good to check the label distribution for a potential imbalance. We do this by plotting the distribution of labels in the text samples.

ax = train_dfa['sentiment'].value_counts(sort=False).plot(kind='barh', color='#EE4747')
Labels are a little bit imbalanced
Labels are a little bit imbalanced

As we can see, our data is a bit imbalanced, but the differences between the classes are still within a range where we can neglect them in order to keep this example simple.

Next, we will take a look at the training data. We will add an additional column in which we store the length of the text samples.

train_dfa['len'] = train_dfa['text'].str.len() # Store string length of each sample
train_dfa = train_dfa.sort_values(['len'], ascending=True)
train_dfa = train_dfa.dropna()
Training data for sentiment classification
Our training data

The the train dataset comprises 27480 text samples. We can also see that we have converted the text labels from the sentiment class in the previous step to integer values. Finally, we also take a look at the test data.

test_dfa = test_df.copy()
test_dfa.replace(cleanup_nums, inplace=True)
Test data for sentiment analysis
Our test data

The test data has 3533 records. Everything looks as expected. So we will continue.

5) Train a Sentiment Classifier

Next, its time, to prepare the data and train a classification model. For the purpose of simplicity we will use the pipeline class of the scikit-learn framework and use a bag-of-word model. A pipeline contains transformation activities and a final estimator. In this tutorial, we will use two different estimators:

  • Logistic Regression
  • Naive Bayes

5a) Sentiment Classifier with Logistic Regression

We will implement our first pipeline with a logistic regression estimator. We will add two transformers to our pipeline and the logistic regression estimator. So the following steps will be performed:

  • CountVectorizer: The vectorizer counts the number of words in each text sequence, and creates the bag-of-word models.
  • TfidfTransformer: The “Term Frequency Transformer” scales down the impact of words that occur very often in the training data and are thus less informative for the estimator than words that occur in a smaller fraction of the text samples. Examples are words such as “to” or “a”.
  • Logistic Regression: By defining the multi_class as ‘auto’, we will use logistic regression in a one-vs-all approach. This approach will split up our three-class prediction problem into two separate two-class problem. In a first step, our model differentiates between one class and all other classes. Then all observations that do not fall into the first class enter a second model that predicts whether it is class two or three.

The following code creates the pipeline and executes it with the training data In this process, the data is transformed as described and the logistic regression model is adapted to the data. After executing the pipeline, we use the test data to validate the prediction model. Finally, we generate a classification report on the predictions and draw a confusion matrix to illustrate the results.

# Create a transformation pipeline
# The pipeline sequentially applies a list of transforms and as a final estimator logistic regression 
pipeline_log = Pipeline([
                ('count', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(solver='liblinear', multi_class='auto')),

# Train model using the created sklearn pipeline
learner_log =['text'], train_dfa['sentiment'])

# Predict class labels using the learner function
test_dfa['pred'] = learner_log.predict(test_dfa['text'])
y_true = test_dfa['sentiment']
y_pred = test_dfa['pred']
target_names = ['negative', 'neutral', 'positive']

# Confusion Matrix
results_log = classification_report(y_true, y_pred, target_names=target_names, output_dict=True)
results_df_log = pd.DataFrame(results_log).transpose()
skplt.metrics.plot_confusion_matrix(y_true,  y_pred, figsize=(12,12))
Confusion matrix of our logistic regression model

Using our test data set, the Logistic Regression model was able to deliver the following results:

#Sentiment ClassTotalCorrectly classified

5b) Sentiment Classifier with Naive Bayes

We will reuse the code from the last step to create another pipeline. The only difference is that we will exchange the logistic regression estimator with Naive Bayes (“MultinomialNB”). Naive Bayes calculates the probability of each tag for our text sequences and then outputs the tag with the highest score.

For example, knowing that the probabilities of appearance of the words “likes” and “good” in texts within the category “positive sentiment” are higher than the probabilities of appearance within the “negative” or “neutral” categories will help the Naive Bayes classifier predict how likely it is for an unknown text that contains those words to be associated with either category. Naive Bayes is commonly used in natural language processing.

Finally, again we create the classification report and plot the results in a confusion matrix.

# Create a pipeline which transforms phrases into normalized feature vectors and uses a bayes estimator
pipeline_bayes = Pipeline([
                ('count', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('gnb', MultinomialNB()),

# Train model using the created sklearn pipeline
learner_bayes =['text'], train_dfa['sentiment'])

# Predict class labels using the learner function
test_dfa['pred'] = learner_bayes.predict(test_dfa['text'])
y_true = test_dfa['sentiment']
y_pred_bayes = test_dfa['pred']
target_names = ['negative', 'neutral', 'positive']

# Confusion Matrix
results_bayes = classification_report(y_true, y_pred, target_names=target_names, output_dict=True)
results_df_bayes = pd.DataFrame(results_bayes).transpose()
skplt.metrics.plot_confusion_matrix(y_true, y_pred, figsize=(12,12))
Confusion matrix of our naive bayes model

Using our test data set, the Naive Bayes model was able to deliver the following results:

#Sentiment ClassTotalCorrectly classified

6) Measuring Multi-class Performance

So which classifier achieved the better performance? It’s not so easy to say. We will therefore compare the classification performance of our two classifiers using the following metrics:

Accuracy is calculated as the ratio between correctly predicted observations and total observations.

Precision is calculated as the ratio between correctly labeled values and the sum of the correctly and incorrectly labeled positive observations.

Recall is calculated as the ratio between correctly predicted observations and the sum of observations that were falsely classified.

F1-Score takes all falsely labeled observations into account. It is therefore useful when you have an unequal class distribution.


You may wonder which of our three classes is the positive class. The answer is that we have to determine the positive class ourself. The other classes will then be counted as negative. You can see this in the confusion matrix in sections 5 and 6, which contain separate metrics for each label. By defining the positive class we can take into account that some classes may be more important than others.

Another option is to define a weighted average (see confusion matrix) that weights the quantity of the different labels in the overall dataset. For example, the negative label is weighted a bit higher than the neutral label, because there are fewer observations with negative and positive labels present in the data. Because, our classes are all equally important, I decided to use the weighted average.

7) Comparing Classifier Performance

The following code calculates the performance metrics for the two classifiers and then creates a barplot to illustrate the results. In this specific case, the recall equals the accuracy.

# Plotting the data

# Preparing bayes classifier metrics
bayes_precision = results_df_bayes['precision'].at['weighted avg']
bayes_f1_score = results_df_bayes['f1-score'].at['weighted avg']
bayes_accuracy = results_df_bayes['recall'].at['weighted avg']

# Preparing logistic regression classifier metrics
log_precision = results_df_log['precision'].at['weighted avg']
log_f1_score = results_df_log['f1-score'].at['weighted avg']
log_accuracy = results_df_log['recall'].at['weighted avg']

# Preparing the plot
fig, ax1 = plt.subplots(figsize=(6, 8))

# set width of bar
barWidth = 0.15
# set height of bar
accuracy = [log_accuracy, bayes_accuracy]
f1_score = [log_f1_score, bayes_f1_score]
precision = [log_precision, bayes_precision]

# Set position of bar on X axis
r1 = np.arange(2)
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]
# Make the plot, accuracy, color='#EE4747', width=barWidth, edgecolor='white', label='accuracy'), f1_score, color='#3333ff', width=barWidth, edgecolor='white', label='recall'), precision, color='#2d7f5e', width=barWidth, edgecolor='white', label='precision')
# Add xticks on the middle of the group bars
plt.xlabel('algorithm', fontweight='bold')
plt.xticks([r + barWidth for r in range(len(accuracy))], ['logistic regression', 'bayes'])
# Create legend & Show graphic
Performance Comparison between Logistic Regression and Naive Bayes
Performance Comparison between Logistic Regression and Naive Bayes

So we see that our Logistic Regression model performs slightly better than the Naive Bayes model. Of course there are still many possibilities to further improve the models. In addition, there are a number of other methods and algorithms with which the performance could be significantly increased.

8) Making Test Predictions

Finally, we make use of the bayes classifier to generate some test predictions. Feel free to try it out! Simply change the text in the text phrases array and convince yourself that the classifier works.

testphrases = ['Mondays just suck!', 'I love this product', 'That is a tree', 'Terrible service']
for testphrase in testphrases:
    resultx = learner_log.predict([testphrase])
    dict = {1: 'Negative', 2: 'Neutral', 3: 'Positive'}
    print(testphrase + '-> ' + dict[resultx[0]])
Mondays just suck!-> Negative
I love this product-> Positive
That is a tree-> Neutral
Terrible service-> Negative


In this tutorial you have learned how to build a simple sentiment classifier that can detect sentiment expressed through text on a three-class scale. We have compared logistic regression and naive bayes and made some test predictions.

The best way to deepen your knowledge on sentiment analysis is to apply it in practice. I thus want to encourage you to use your knowledge by tackling other NLP challenges. For example, you could build a classification model that assigns text phrases to labels such as sport, fashion, cars, technology, and so on. The prerequisite is that you have sufficient amounts of data available to train the classifier.

Let me know if you likely this tutorial and also if you didn’t. I appreciate your feedback!

Follow Florian Müller:

Data Scientist & Machine Learning Consultant

Hi, my name is Florian! I am a Zurich-based Data Scientist with a passion for Artificial Intelligence and Machine Learning. After completing my PhD in Business Informatics at the University of Bremen, I started working as a Machine Learning Consultant for the swiss consulting firm ipt. When I'm not working on use cases for our clients, I work on own analytics projects and report on them in this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *