<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Logistic Regression Archives - relataly.com</title>
	<atom:link href="https://www.relataly.com/category/machine-learning-algorithms/logistic-regression/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.relataly.com/category/machine-learning-algorithms/logistic-regression/</link>
	<description>The Business AI Blog</description>
	<lastBuildDate>Sat, 27 May 2023 10:38:22 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://www.relataly.com/wp-content/uploads/2023/04/cropped-AI-cat-Icon-White.png</url>
	<title>Logistic Regression Archives - relataly.com</title>
	<link>https://www.relataly.com/category/machine-learning-algorithms/logistic-regression/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">175977316</site>	<item>
		<title>Training a Sentiment Classifier with Naive Bayes and Logistic Regression in Python</title>
		<link>https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/</link>
					<comments>https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Sat, 20 Jun 2020 21:49:05 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Classification (multi-class)]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Insurance]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Naive Bayes]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Retail]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Sentiment Analysis]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[AI in Business]]></category>
		<category><![CDATA[AI in E-Commerce]]></category>
		<category><![CDATA[Beginner Tutorials]]></category>
		<category><![CDATA[Classic Machine Learning]]></category>
		<category><![CDATA[Digital Transformation]]></category>
		<category><![CDATA[Social Media Data]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=2007</guid>

					<description><![CDATA[<p>Are you ready to learn about the exciting world of social media sentiment analysis using Python? In this article, we&#8217;ll dive into how companies are leveraging machine learning to extract insights from Twitter comments, and how you can do the same. By comparing two popular classification models &#8211; Naive Bayes and Logistic Regression &#8211; we&#8217;ll ... <a title="Training a Sentiment Classifier with Naive Bayes and Logistic Regression in Python" class="read-more" href="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/" aria-label="Read more about Training a Sentiment Classifier with Naive Bayes and Logistic Regression in Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/">Training a Sentiment Classifier with Naive Bayes and Logistic Regression in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Are you ready to learn about the exciting world of social media sentiment analysis using Python? In this article, we&#8217;ll dive into how companies are leveraging machine learning to extract insights from Twitter comments, and how you can do the same. By comparing two popular classification models &#8211; Naive Bayes and Logistic Regression &#8211; we&#8217;ll help you identify which one best fits your needs.</p>



<p class="wp-block-paragraph">Businesses are using sentiment analysis to make better sense of the vast amounts of data available online and on social media platforms. Understanding customer opinions and feedback can help companies identify trends and make more informed decisions. Whether you&#8217;re a business professional looking to leverage the power of social media data or a machine learning enthusiast, this article has everything you need to get started.</p>



<p class="wp-block-paragraph">We&#8217;ll begin with an introduction to the concept of sentiment analysis and its theoretical foundations. Then, we&#8217;ll guide you through the practical steps of implementing a sentiment classifier in Python. Our model will analyze text snippets and categorize them into one of three sentiment categories: &#8220;positive,&#8221; &#8220;neutral,&#8221; or &#8220;negative.&#8221; Finally, we&#8217;ll compare the performance of Naive Bayes and Logistic Regression classifiers.</p>



<p class="wp-block-paragraph">By the end of this article, you&#8217;ll have the skills and knowledge to perform sentiment analysis on social media data and apply these insights to your business or personal projects. So let&#8217;s jump right in!</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="512" height="423" data-attachment-id="13349" data-permalink="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/flo7up_a_person_talking_to_a_virtual_assistant-_colorful_popart_2f34967a-ce4e-420d-bc01-75a4e47c1181-copy/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/Flo7up_a_person_talking_to_a_virtual_assistant._Colorful_popart_2f34967a-ce4e-420d-bc01-75a4e47c1181-Copy.png" data-orig-size="920,760" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Flo7up_a_person_talking_to_a_virtual_assistant._Colorful_popart_2f34967a-ce4e-420d-bc01-75a4e47c1181-Copy" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/Flo7up_a_person_talking_to_a_virtual_assistant._Colorful_popart_2f34967a-ce4e-420d-bc01-75a4e47c1181-Copy.png" src="https://www.relataly.com/wp-content/uploads/2023/03/Flo7up_a_person_talking_to_a_virtual_assistant._Colorful_popart_2f34967a-ce4e-420d-bc01-75a4e47c1181-Copy-512x423.png" alt="Sentiment analysis has various use cases from analyzing social media to reviewing customer feedback in call centers." class="wp-image-13349" srcset="https://www.relataly.com/wp-content/uploads/2023/03/Flo7up_a_person_talking_to_a_virtual_assistant._Colorful_popart_2f34967a-ce4e-420d-bc01-75a4e47c1181-Copy.png 512w, https://www.relataly.com/wp-content/uploads/2023/03/Flo7up_a_person_talking_to_a_virtual_assistant._Colorful_popart_2f34967a-ce4e-420d-bc01-75a4e47c1181-Copy.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/Flo7up_a_person_talking_to_a_virtual_assistant._Colorful_popart_2f34967a-ce4e-420d-bc01-75a4e47c1181-Copy.png 768w, https://www.relataly.com/wp-content/uploads/2023/03/Flo7up_a_person_talking_to_a_virtual_assistant._Colorful_popart_2f34967a-ce4e-420d-bc01-75a4e47c1181-Copy.png 920w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">Sentiment analysis has various use cases from analyzing social media to reviewing customer feedback in call centers.</figcaption></figure>
</div>
</div>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/" target="_blank" rel="noreferrer noopener">Classifying Purchase Intention of Online Shoppers with Python</a></p>
</div>
</div>



<h2 class="wp-block-heading">What is Sentiment Analysis?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Sentiment analysis is the process of identifying the sentiment, or emotional tone, of a piece of text. This can be useful for a wide range of applications, such as identifying customer sentiment towards a product or service, or detecting the overall sentiment of a social media post or news article.</p>



<p class="wp-block-paragraph">Sentiment analysis is typically performed using natural language processing (NLP) techniques and machine learning algorithms. These tools allow computers to &#8220;understand&#8221; the meaning of text and identify the sentiment it contains. Sentiment analysis can be performed at various levels of granularity, from identifying the sentiment of an entire document to identifying the sentiment of individual words or phrases within a document.</p>



<h2 class="wp-block-heading">How Sentiment Classification Works</h2>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4767" data-permalink="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/image-61-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-61.png" data-orig-size="1171,492" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-61" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-61.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-61-1024x430.png" alt="sentiment classification using bayes and logistic regression in python" class="wp-image-4767" width="772" height="326" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-61.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-61.png 768w" sizes="(max-width: 772px) 100vw, 772px" /><figcaption class="wp-element-caption">A sentiment classifier with three classes</figcaption></figure>



<p class="wp-block-paragraph">There are many different approaches to sentiment analysis, and the specific methods used can vary depending on the specific application and the type of text being analyzed. Some common techniques for performing sentiment analysis include using machine learning algorithms to classify text as positive, negative, or neutral, and using lexicons, or lists of words with pre-defined sentiment, to identify the sentiment of individual words or phrases. In this way, it is possible to measure the emotions towards a specific topic, e.g., products, brands, political parties, services, or trends. </p>



<p class="wp-block-paragraph">We can show how sentiment analysis works with a simple example:</p>



<ul class="wp-block-list">
<li>&#8220;This product is excellent!&#8221;</li>



<li>&#8220;I don&#8217;t like this ice cream at all.&#8221;</li>



<li>&#8220;Yesterday, I&#8217;ve seen a dolphin.&#8221;</li>
</ul>



<p class="wp-block-paragraph">While the first sentence denotes a positive sentiment, the second sentence is negative, and in the third sentence, the sentiment is neutral. A sentiment classifier can automatically label these sentences:</p>



<figure class="wp-block-table"><table><tbody><tr><td><strong>Text Sequence</strong></td><td class="has-text-align-left" data-align="left"><strong>Sentiment Label</strong></td></tr><tr><td>This product is great!</td><td class="has-text-align-left" data-align="left">POSITIVE</td></tr><tr><td>I wouldn&#8217;t say I like this ice cream at all.</td><td class="has-text-align-left" data-align="left">NEGATIVE</td></tr><tr><td>Yesterday I saw a dolphin.</td><td class="has-text-align-left" data-align="left">NEUTRAL</td></tr></tbody></table><figcaption class="wp-element-caption">Sentiment Labels of Text Sequences </figcaption></figure>



<p class="wp-block-paragraph">Predicting sentiment classes opens the door to more advanced statistical analysis and automated text processing. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-use-cases-for-sentiment-analysis">Use Cases for Sentiment Analysis</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Sentiment analysis is used in various application domains, including the following:</p>



<ul class="wp-block-list">
<li>Sentiment analysis can lead to more efficient customer service by prioritizing customer requests. For example, when customers complain about services or products, an algorithm can identify and prioritize these messages so that sales agents answer them first. This can increase customer satisfaction and reduce the churn rate. </li>



<li>Twitter and Amazon reviews have become the first port of call for many customers today when exchanging information about products, brands, and trends or expressing their own opinions. A sentiment classifier systematically enables businesses to evaluate this information. It can collect data from social media posts and product reviews in real-time. For example, marketing managers can quickly obtain feedback on how well customers perceive campaigns and ads.</li>



<li>In stock market prediction, analyze the sentiment of social media or news feeds towards stocks or brands. The sentiment is then used as an additional feature alongside price data to create better forecasting models. Some forecasting also approaches exclusively rely on sentiment.</li>
</ul>



<p class="wp-block-paragraph">Sentiment Analysis will find further adoption in the coming years. Especially in marketing and customer service, companies will increasingly use sentiment analysis to automate business processes and offer their customers a better customer experience.</p>



<h3 class="wp-block-heading" id="h-how-sentiment-analysis-works-feature-modelling">How Sentiment Analysis Works: Feature Modelling</h3>



<p class="wp-block-paragraph">An essential step in the development of the Sentiment Classifier is language modeling. Before we can train a machine learning model, we need to bring the natural text into a structured format that the model can statistically assess in the training process. Various modeling techniques exist for this purpose. The two most common models are <strong>bag-of-words </strong>and <strong>n-grams</strong>.</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/business-use-cases-for-openai-gpt-models-chatgpt-davinci/12200/" target="_blank" rel="noreferrer noopener">9 Powerful Applications of OpenAI&#8217;s ChatGPT and Davinci</a></p>



<h4 class="wp-block-heading" id="h-bag-of-word-model">Bag-of-word Model</h4>



<p class="wp-block-paragraph">The bag-of-word model calculates probability distributions over the number of unique words. This approach converts individual words into individual features. Fill words with low predictive power, such as &#8220;the&#8221; or &#8220;a,&#8221; will be filtered out. Consider the following text sample: </p>



<p class="wp-block-paragraph"><em>&#8220;Bob likes to play basketball. But his friend Daniel prefers to play soccer. &#8220;</em></p>



<p class="wp-block-paragraph">Through filtering of fill words, we convert his sample to: </p>



<p class="wp-block-paragraph"><em>&#8220;Bob&#8221;, &#8220;likes&#8221;, &#8220;play&#8221;, &#8220;basketball&#8221;, &#8220;friend&#8221;, &#8220;Daniel&#8221;, &#8220;play&#8221;, &#8220;soccer&#8221;</em>.</p>



<p class="wp-block-paragraph">In the next step, the algorithm converts these words into a normalized form, where each word becomes a column:</p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4227" data-permalink="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/image-15-9/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/05/image-15.png" data-orig-size="1046,134" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-15" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/05/image-15.png" src="https://www.relataly.com/wp-content/uploads/2021/05/image-15-1024x131.png" alt="" class="wp-image-4227" width="613" height="77" srcset="https://www.relataly.com/wp-content/uploads/2021/05/image-15.png 1024w, https://www.relataly.com/wp-content/uploads/2021/05/image-15.png 300w, https://www.relataly.com/wp-content/uploads/2021/05/image-15.png 768w" sizes="(max-width: 613px) 100vw, 613px" /><figcaption class="wp-element-caption">Text sample after transformation</figcaption></figure>



<p class="wp-block-paragraph">The bag-of-word model is easy to implement. However, it does not consider grammar or word order.</p>



<h4 class="wp-block-heading" id="h-what-is-an-n-gram-model">What is an N-gram Model?</h4>



<p class="wp-block-paragraph">The n-gram model considers multiple consecutive words in a text sequence and thus captures word sequence. The n stands for the number of words considered. </p>



<p class="wp-block-paragraph">For example, in a 2-gram model, the sentence <em>&#8220;Bob likes to play basketball. But his friend Daniel prefers to play soccer.&#8221;</em> will be converted to the following model: </p>



<p class="wp-block-paragraph">&#8220;Bob likes,&#8221; &#8220;likes to,&#8221; &#8220;to play,&#8221; &#8220;play basketball,&#8221; and so on. The n-gram model is often used to supplement the bag-of-word model. It is also possible to combine different n-gram models. For a 3-gram model, the text would be converted to &#8220;Bob likes to,&#8221; &#8220;likes to play,&#8221; &#8220;to play basketball,&#8221; and so on. Combining multiple n-gram models, however, can quickly increase model complexity.</p>



<h3 class="wp-block-heading" id="h-sentiment-classes-and-model-training">Sentiment Classes and Model Training </h3>



<p class="wp-block-paragraph">The training of sentiment classifiers traditionally takes place in a supervised learning process. For this purpose, a training data set is used, which contains text sections with associated sentiment tendencies as prediction labels. Depending on which labels we provide and the training data, the classifier will learn to predict sentiment on a more or less fine-grained scale. Capturing neutral sentiment requires choosing an odd number of classes. </p>



<p class="wp-block-paragraph">More advanced classifiers can detect different sorts of emotions and, for example, detect whether someone expresses anger, happiness, sadness, and so on. It basically comes down to which prediction labels you provide with the training data.</p>



<p class="wp-block-paragraph">When the classifier is trained on a one-gram model, the classifier will learn that certain words such as &#8220;good&#8221; or &#8220;great&#8221; increase the probability that a text is associated with a positive sentiment. Consequently, when the classifier encounters these words in a new text sample, it will predict a higher probability of positive sentiment. On the other hand, the classifier will learn that words such as &#8220;hate&#8221; or &#8220;dislike&#8221; are often used to express negative opinions and thus increase the probability of negative sentiment.</p>



<h3 class="wp-block-heading" id="h-language-complications">Language Complications </h3>



<p class="wp-block-paragraph">Is sentiment analysis that simple? Well, not quite. The cases described so far were deliberately chosen to be very simple. However, human language is very complex, and many peculiarities make it more difficult in practice to identify the sentiment in a sentence or paragraph. Here are some examples:</p>



<ul class="wp-block-list">
<li>Inversions: &#8220;this product is not so great.&#8221;</li>



<li>Typos: &#8220;I live this product!&#8221;</li>



<li>Comparisons: &#8220;Product a is better than product z.&#8221;</li>



<li>In a text passage, expression of pros and cons: &#8220;An advantage is that. But on the other hand&#8230;&#8221; </li>



<li>Unknown vocabulary: &#8220;This product is just whuopii!&#8221;</li>



<li>Missing words: &#8220;How can you not  this product?&#8221;</li>
</ul>



<p class="wp-block-paragraph">Fortunately, there are methods to solve the complications mentioned above. I will explain more about them in one of my future articles. But for now, let&#8217;s stay with the basics and implement a simple classifier.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading" id="h-training-a-sentiment-classifier-using-twitter-data-in-python">Training a Sentiment Classifier Using Twitter Data in Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Venturing into the practical aspects of sentiment classification, our aim in this tutorial is to create an efficient sentiment classifier. Our focus will be on a dataset provided by Kaggle, comprising tens of thousands of tweets, each categorized as positive, neutral, or negative.</p>



<p class="wp-block-paragraph">Our objective is to design a classifier capable of assigning one of these three sentiment categories to new text sequences. To this end, we will employ two distinct algorithms &#8211; Logistic Regression and Naive Bayes &#8211; as our estimators.</p>



<p class="wp-block-paragraph">The tutorial culminates with a comparative analysis of the prediction performance of both models, followed by a set of test predictions. Through this hands-on approach, you will gain an understanding of the nuances of sentiment classification and its application in understanding public opinion, especially on social media platforms like Twitter.</p>



<p class="wp-block-paragraph">Boost your sentiment analysis skills with our step-by-step guide, and learn to leverage machine learning tools for precise sentiment prediction.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_9fa82e-91"><a class="kb-button kt-button button kb-btn_c78218-5c kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/08%20Natural%20Language%20Processing/700%20NLP%20-%20Simple%20Sentiment%20Analysis%20using%20Bayes%20and%20Logistic%20Regression.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_41bac4-d5 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before starting the coding part, make sure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. If you don&#8217;t have an environment, follow&nbsp;<a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><a href="https://docs.python.org/3/library/math.html" target="_blank" rel="noreferrer noopener">math</a></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using the machine learning libraries <a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener">scikit-learn</a> and <a href="https://seaborn.pydata.org/" target="_blank" rel="noreferrer noopener">seaborn</a> for visualization. </p>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-about-the-sentiment-dataset">About the Sentiment Dataset</h3>



<p class="wp-block-paragraph">Let&#8217;s begin with the technical part. First, we will download the data from the <a href="https://www.kaggle.com/c/liverpool-ion-switching/data">Twitter sentiment example</a> on Kaggle.com. If you are working with the Kaggle Python environment, you can also directly save the data into your Python project. </p>



<p class="wp-block-paragraph">We will only use the following two CSV files:</p>



<ul class="wp-block-list">
<li><strong>train.csv:</strong> contains 27480 text samples.</li>



<li><strong>test.csv:</strong> contains 3533 text samples for validation purposes</li>
</ul>



<p class="wp-block-paragraph">The two files contain four columns:</p>



<ul class="wp-block-list">
<li>textID: An identifier</li>



<li>text: The raw text</li>



<li>selected_text: Contains a selected part of the original text</li>



<li>sentiment: Contains the prediction label</li>
</ul>



<p class="wp-block-paragraph">We will copy the two files (train.csv and test.csv) into a folder that you can access from your Python environment. For simplicity, I recommend putting these files directly into the folder of your Python notebook. If you put them somewhere else, don&#8217;t forget to adjust the file path when loading the data.</p>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1 Load the Data</h3>



<p class="wp-block-paragraph">Assuming that you have copied the files into your Python environment, the next step is to load the data into your Python project and convert it into a Pandas DataFrame. The following code performs these steps and then prints a data summary.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import matplotlib

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, multilabel_confusion_matrix
import scikitplot as skplt

import seaborn as sns

# Load the train data
train_path = &quot;train.csv&quot;
train_df = pd.read_csv(train_path) 

# Load the test data
sub_test_path = &quot;test.csv&quot;
test_df = pd.read_csv(sub_test_path) 

# Print a Summary of the data
print(train_df.shape, test_df.shape)
print(train_df.head(5))</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">			textID		text												selected_text									sentiment
0			cb774db0d1	I`d have responded, if I were going					I`d have responded, if I were going				neutral
1			549e992a42	Sooo SAD I will miss you here in San Diego!!!		Sooo SAD										negative
2			088c60f138	my boss is bullying me...							bullying me										negative
3			9642c003ef	what interview! leave me alone						leave me alone									negative
4			358bd9e861	Sons of ****, why couldn`t they put them on t...	Sons of ****,									negative
...	
27481 rows × 4 columns</pre></div>



<h3 class="wp-block-heading" id="h-step-2-clean-and-preprocess-the-data">Step #2 Clean and Preprocess the Data</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">Next, let&#8217;s quickly clean and preprocess the data. First, as a best practice, we will transform the sentiment labels of the train and the test data into numeric values.</p>



<p class="wp-block-paragraph">In addition, we will add a column in which we store the length of the text samples.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="2042" data-permalink="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/image-12-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/06/image-12.png" data-orig-size="389,108" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-12" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/06/image-12.png" src="https://www.relataly.com/wp-content/uploads/2020/06/image-12.png" alt="" class="wp-image-2042" width="248" height="68" srcset="https://www.relataly.com/wp-content/uploads/2020/06/image-12.png 389w, https://www.relataly.com/wp-content/uploads/2020/06/image-12.png 300w" sizes="(max-width: 248px) 100vw, 248px" /><figcaption class="wp-element-caption">Three-class sentiment scale</figcaption></figure>
</div>
</div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Define Class Integer Values
cleanup_nums = {&quot;sentiment&quot;: {&quot;negative&quot;: 1, &quot;neutral&quot;: 2, &quot;positive&quot;: 3}}

# Replace the Classes with Integer Values
train_df = train_base_df.copy()
train_df.replace(cleanup_nums, inplace=True)

# Clean the Test Data
test_df = test_base_df.copy()
test_df.replace(cleanup_nums, inplace=True)

# Create a Feature based on Text Length
train_df['text_length'] = train_df['text'].str.len() # Store string length of each sample
train_df = train_df.sort_values(['text_length'], ascending=True)
train_df = train_df.dropna()
train_df </pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">			textID			text			selected_text	sentiment	text_length
14339		5c6abc28a1		ow				ow				2			3.0
26005		0b3fe0ca78		?				?				2			3.0
11524		4105b6a05d		aw				aw				2			3.0
641			5210cc55ae		no				no				2			3.0
25699		ee8ee67cb3		ME				ME				2			3.0
...</pre></div>



<h3 class="wp-block-heading" id="h-step-3-explore-the-data">Step #3 Explore the Data</h3>



<p class="wp-block-paragraph">It&#8217;s always good to check the label distribution for a potential imbalance. We do this by plotting the distribution of labels in the text samples. This is important because it helps ensure that the trained model can make accurate predictions on new data. If the class labels are unbalanced, then the model is more likely to be biased toward the more common classes, which can lead to poor performance on less common classes. </p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/" target="_blank" rel="noreferrer noopener">Feature Engineering and Selection for Regression Models</a></p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Print the Distribution of Sentiment Labels
sns.set_theme(style=&quot;whitegrid&quot;)
ax = train_df['sentiment'].value_counts(sort=False).plot(kind='barh', color='b')
ax.set_xlabel('Count')
ax.set_ylabel('Labels')</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="378" height="265" data-attachment-id="4636" data-permalink="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/image-42-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-42.png" data-orig-size="378,265" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-42" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-42.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-42.png" alt="Balance of Class Labels in a Machine Learning Use Case" class="wp-image-4636" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-42.png 378w, https://www.relataly.com/wp-content/uploads/2021/06/image-42.png 300w" sizes="(max-width: 378px) 100vw, 378px" /></figure>



<p class="wp-block-paragraph">As we can see, our data is a bit imbalanced, but the differences are still within an acceptable range. </p>



<p class="wp-block-paragraph"> Let&#8217;s also quickly take a look at the distribution of text length. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Visualize a distribution of text_length
sns.histplot(data=train_df, x='text_length', bins='auto', color='darkblue');
plt.title('Text Length Distribution')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4635" data-permalink="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/image-41-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-41.png" data-orig-size="397,279" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-41" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-41.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-41.png" alt="" class="wp-image-4635" width="468" height="329" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-41.png 397w, https://www.relataly.com/wp-content/uploads/2021/06/image-41.png 300w" sizes="(max-width: 468px) 100vw, 468px" /></figure>



<h3 class="wp-block-heading" id="h-step-4-train-a-sentiment-classifier">Step #4 Train a Sentiment Classifier </h3>



<p class="wp-block-paragraph">Next, we will prepare the data and train a classification model. We will use the pipeline class of the scikit-learn framework and a bag-of-word model to keep things simple. In NLP, we typically have to transform and split up the text into sentences and words. The pipeline class is thus instrumental in NLP because it allows us to perform multiple actions on the same data in a row.</p>



<p class="wp-block-paragraph">The pipeline contains transformation activities and a prediction algorithm, the final estimator. In the following, we create two pipelines that use two different prediction algorithms: </p>



<ul class="wp-block-list">
<li>Logistic Regression </li>



<li>Naive Bayes</li>
</ul>



<h4 class="wp-block-heading" id="h-4a-sentiment-classification-using-logistic-regression">4a) Sentiment Classification using Logistic Regression </h4>



<p class="wp-block-paragraph">The first model that we will train uses the logistic regression algorithm. We create a new pipeline. Then we add two transformers and the logistic regression estimator. The pipeline will perform the following activities. </p>



<ul class="wp-block-list">
<li><strong><a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">CountVectorizer</a></strong>: The vectorizer counts the number of words in each text sequence and creates the bag-of-word models. </li>



<li><strong><a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html">TfidfTransformer</a></strong>: The &#8220;Term Frequency Transformer&#8221; scales down the impact of words that occur very often in the training data and are thus less informative for the estimator than words that occur in a smaller fraction of the text samples. Examples are words such as &#8220;to&#8221; or &#8220;a.&#8221;</li>



<li><a href="https://www.relataly.com/category/machine-learning-algorithms/logistic-regression/" target="_blank" rel="noreferrer noopener">Logistic Regression</a>: By defining the multi_class as &#8216;auto,&#8217; we will use logistic regression in a one-vs-all approach. This approach will split our three-class prediction problem into two two-class problems. Our model differentiates between one class and all other classes in the first step. Then all observations that do not fall into the first class enter a second model that predicts whether it is class two or three. </li>
</ul>



<p class="wp-block-paragraph">Our pipeline will transform the data and fit the logistic regression model to the training data. After executing the pipeline, we will directly evaluate the model&#8217;s performance. We will do this by defining a function that generates predictions on the test dataset and then evaluating the performance of our model. The function will print the performance results and store them in a dataframe. Later, when we want to compare the models, we can access the results from the dataframe. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create a transformation pipeline
# The pipeline sequentially applies a list of transforms and as a final estimator logistic regression 
pipeline_log = Pipeline([
                ('count', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(solver='liblinear', multi_class='auto')),
        ])

# Train model using the created sklearn pipeline
model_name = 'logistic regression classifier'
model_lgr = pipeline_log.fit(train_df['text'], train_df['sentiment'])

def evaluate_results(model, test_df):
    # Predict class labels using the learner function
    test_df['pred'] = model.predict(test_df['text'])
    y_true = test_df['sentiment']
    y_pred = test_df['pred']
    target_names = ['negative', 'neutral', 'positive']

    # Print the Confusion Matrix
    results_log = classification_report(y_true, y_pred, target_names=target_names, output_dict=True)
    results_df_log = pd.DataFrame(results_log).transpose()
    print(results_df_log)
    matrix = confusion_matrix(y_true,  y_pred)
    sns.heatmap(pd.DataFrame(matrix), 
                annot=True, fmt=&quot;d&quot;, linewidths=.5, cmap=&quot;YlGnBu&quot;)
    plt.xlabel('Predictions')
    plt.xlabel('Actual')
    
    model_score = score(y_pred, y_true, average='macro')
    return model_score

    
# Evaluate model performance
model_score = evaluate_results(model_lgr, test_df)
performance_df = pd.DataFrame().append({'model_name': model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4643" data-permalink="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/image-47-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-47.png" data-orig-size="634,555" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-47" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-47.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-47.png" alt="performance of our logistic regression sentiment classifier" class="wp-image-4643" width="562" height="492" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-47.png 634w, https://www.relataly.com/wp-content/uploads/2021/06/image-47.png 300w" sizes="(max-width: 562px) 100vw, 562px" /></figure>



<h4 class="wp-block-heading" id="h-4b-sentiment-classification-using-naive-bayes">4b) Sentiment Classification using Naive Bayes</h4>



<p class="wp-block-paragraph">We will reuse the code from the last step to create another pipeline. However, we will exchange the Logistic Regressor with Naive Bayes (&#8220;MultinomialNB&#8221;). Naive Bayes is commonly used in natural language processing. The algorithm calculates the probability of each tag for a text sequence and then outputs the tag with the highest score. For example, the probabilities of the appearance of the words &#8220;likes&#8221; and &#8220;good&#8221; in texts within the category &#8220;positive sentiment&#8221; are higher than the probabilities of formation within the &#8220;negative&#8221; or &#8220;neutral&#8221; categories. In this way, the model predicts how likely it is for an unknown text that contains those words to be associated with either category. </p>



<p class="wp-block-paragraph">We will reuse the previously defined function to print a classification report and plot the results in a confusion matrix. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create a pipeline which transforms phrases into normalized feature vectors and uses a bayes estimator
model_name = 'bayes classifier'

pipeline_bayes = Pipeline([
                ('count', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('gnb', MultinomialNB()),
                ])

# Train model using the created sklearn pipeline
model_bayes = pipeline_bayes.fit(train_df['text'], train_df['sentiment'])

# Evaluate model performance
model_score = evaluate_results(model_bayes, test_df)
performance_df = performance_df.append({'model_name': model_name, 
                                    'f1_score': model_score[0], 
                                    'precision': model_score[1], 
                                    'recall': model_score[2]}, ignore_index=True)</pre></div>



<h3 class="wp-block-heading" id="h-step-5-measuring-multi-class-performance">Step #5 Measuring Multi-class Performance</h3>



<p class="wp-block-paragraph">So which classifier achieved better performance? It&#8217;s not so easy to say because it depends on the metrics. We will compare the classification performance of our two classifiers using the following metrics:</p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<ul class="wp-block-list">
<li><strong>Accuracy </strong>is calculated as the ratio between correctly predicted observations and total observations.</li>



<li><strong>Precision</strong> is calculated as the ratio between correctly labeled values and the sum of the correctly and incorrectly labeled positive observations.</li>



<li>The formula for<strong> Recall </strong>is the ratio between correctly predicted observations and the sum of falsely classified observations. </li>



<li><strong>F1-Score</strong> takes all falsely labeled observations into account. It is, therefore, useful when you have an unequal class distribution.</li>
</ul>
</div>
</div>



<p class="wp-block-paragraph">You may wonder which of our three classes is the positive class. The answer is that we have to determine the positive class ourselves. By defining the positive class, we can consider that some classes may be more important than others. The other classes will then be counted as negative. You can see this in the confusion matrix in sections 5 and 6, containing separate metrics for each label. </p>



<p class="wp-block-paragraph">Another option is to define a weighted average (see confusion matrix) that weights the quantity of the different labels in the overall dataset. For example, the negative label is weighted a bit higher than the neutral label because fewer observations with negative and positive labels are present in the data. Because our classes are equally important, I decided to use the weighted average. </p>



<h3 class="wp-block-heading" id="h-step-6-comparing-model-performance">Step #6 Comparing Model Performance</h3>



<p class="wp-block-paragraph">The following code calculates the performance metrics for the two classifiers and then creates a barplot to illustrate the results. In this specific case, the recall equals the accuracy. </p>



<p class="wp-block-paragraph">If you want to learn more about measuring classification performance, check out<a href="https://www.relataly.com/measuring-classification-performance-with-python-and-scikit-learn/846/" target="_blank" rel="noreferrer noopener"> this article</a>.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Compare model performance
print(performance_df)

performance_df = performance_df.sort_values('model_name')
fig, ax = plt.subplots(figsize=(12, 4))
tidy = performance_df.melt(id_vars='model_name').rename(columns=str.title)
sns.barplot(y='Model_Name', x='Value', hue='Variable', data=tidy, ax=ax, palette='husl',  linewidth=1, edgecolor=&quot;w&quot;)
plt.title('Model Outlier Detection Performance (Macro)')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4639" data-permalink="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/image-44-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-44.png" data-orig-size="1164,510" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-44" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-44.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-44-1024x449.png" alt="Performance comparison of the bayes classification model and the logistic regression classifier" class="wp-image-4639" width="744" height="326" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-44.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-44.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-44.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-44.png 1164w" sizes="(max-width: 744px) 100vw, 744px" /></figure>



<p class="wp-block-paragraph">So we see that our Logistic Regression model performs slightly better than the Naive Bayes model. Of course, there are still many possibilities to improve the models further. In addition, there are several other methods and algorithms with which the performance could be significantly increased.</p>



<h3 class="wp-block-heading" id="h-step-7-make-test-predictions">Step #7 Make Test Predictions</h3>



<p class="wp-block-paragraph">Finally, we use the Bayes classifier to generate some test predictions. Feel free to try it out! Change the text in the text phrases array and convince yourself that the classifier works. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">testphrases = ['Mondays just suck!', 'I love this product', 'That is a tree', 'Terrible service']
for testphrase in testphrases:
    resultx = model_lgr.predict([testphrase]) # use model_bayes for predictions with the other model
    dict = {1: 'Negative', 2: 'Neutral', 3: 'Positive'}
    print(testphrase + '-&gt; ' + dict[resultx[0]])</pre></div>



<ul class="wp-block-list">
<li>Mondays suck!-&gt; Negative </li>



<li>I love this product-&gt; Positive </li>



<li>That is a tree-&gt; Neutral </li>



<li>Terrible service-&gt; Negative</li>
</ul>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p class="wp-block-paragraph">That&#8217;s it! In this tutorial, you have learned to build a simple sentiment classifier that can detect sentiment expressed through text on a three-class scale. We have trained and tested two standard classification algorithms &#8211; Logistic Regression and Naive Bayes. Finally, we have compared the performance of the two algorithms and made some test predictions. </p>



<p class="wp-block-paragraph">The best way to deepen your knowledge of sentiment analysis is to apply it in practice. I thus want to encourage you to use your knowledge by tackling other NLP challenges. For example, you could build a sentiment classifier that assigns text phrases to labels such as sports, fashion, cars, technology, etc. If you are still looking for data you can use for such a project, you will find exciting ones on Kaggle.com.</p>



<p class="wp-block-paragraph">Let me know if you found this tutorial helpful. I appreciate your feedback!</p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<div style="display: inline-block;">
  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1999579577&amp;asins=1999579577&amp;linkId=91d862698bf9010ff4c09539e4c49bf4&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1839217715&amp;asins=1839217715&amp;linkId=356ba074068849ff54393f527190825d&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/">Training a Sentiment Classifier with Naive Bayes and Logistic Regression in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2007</post-id>	</item>
		<item>
		<title>Classifying Purchase Intention of Online Shoppers with Python</title>
		<link>https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/</link>
					<comments>https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Mon, 11 May 2020 21:42:35 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Classification (two-class)]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Data Sources]]></category>
		<category><![CDATA[Feature Permutation Importance]]></category>
		<category><![CDATA[Insurance]]></category>
		<category><![CDATA[Kaggle Competitions]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Marketing Automation]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Retail]]></category>
		<category><![CDATA[Sales Forecasting]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[AI in E-Commerce]]></category>
		<category><![CDATA[AI in Marketing]]></category>
		<category><![CDATA[Beginner Tutorials]]></category>
		<category><![CDATA[Classic Machine Learning]]></category>
		<category><![CDATA[Classification Error Metrics]]></category>
		<category><![CDATA[Confusion Matrix]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<category><![CDATA[Whisker Plots]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=982</guid>

					<description><![CDATA[<p>Online shopping has become a part of our daily lives, and online stores are continually seeking to improve their sales. One way to achieve this is by using machine learning to predict customers&#8217; purchase intentions. This innovative process can help businesses understand their customers&#8217; behavior and tailor their marketing strategies accordingly. In this article, we ... <a title="Classifying Purchase Intention of Online Shoppers with Python" class="read-more" href="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/" aria-label="Read more about Classifying Purchase Intention of Online Shoppers with Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/">Classifying Purchase Intention of Online Shoppers with Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Online shopping has become a part of our daily lives, and online stores are continually seeking to improve their sales. One way to achieve this is by using machine learning to predict customers&#8217; purchase intentions. This innovative process can help businesses understand their customers&#8217; behavior and tailor their marketing strategies accordingly.</p>



<p class="wp-block-paragraph">In this article, we will explore the practical side of purchase intention prediction. Our focus is on developing a classification model that predicts whether a visitor will make a purchase or not. We&#8217;ll use Scikit-Learn&#8217;s machine learning library to train a Logistic Regression algorithm, and evaluate the model&#8217;s performance. Our ultimate goal is to provide insights into the circumstances under which customers make purchase decisions.</p>



<p class="wp-block-paragraph">Predicting purchase intentions can offer significant benefits to online stores, such as identifying potential customers who are most likely to buy and targeting their marketing efforts accordingly. By understanding the practical application of machine learning for purchase intention prediction, online businesses can gain a competitive edge and increase their revenue.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/" target="_blank" rel="noreferrer noopener">Sentiment Analysis with Naive Bayes and Logistic Regression in Python</a></p>



<h2 class="wp-block-heading">About Modeling Customer Purchase Intentions</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Customer purchase intention prediction is the process of using machine learning algorithms to predict the likelihood that a particular customer will make a purchase. This can be useful for various applications, such as identifying potential customers most likely interested in a particular product or service and targeting marketing and sales efforts accordingly.</p>



<p class="wp-block-paragraph">To make accurate predictions about customer purchase intentions, it is important to have access to high-quality data about the customer, such as their demographic information, purchasing history, and other relevant factors. By analyzing this data and applying appropriate machine learning algorithms, it is possible to identify patterns and trends that can predict the likelihood that a particular customer will make a purchase.</p>



<p class="wp-block-paragraph">There are many different approaches to customer purchase intention prediction, and the specific methods used can vary depending on the application and the data available. Some common techniques for predicting customer purchase intentions include using regression analysis to model the relationship between purchase intentions and other variables and using classification algorithms to classify customers as likely or unlikely to make a purchase. By using these techniques, it is possible to make more accurate and useful predictions about customer purchase intentions.</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/predicting-the-customer-churn-of-a-telecommunications-provider/2378/" target="_blank" rel="noreferrer noopener">Customer Churn Prediction &#8211; Understanding Models with Feature Permutation Importance</a></p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="478" height="500" data-attachment-id="12685" data-permalink="https://www.relataly.com/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min.png" data-orig-size="478,500" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="men and woman doing groceries machine learning customer purchase intention prediction relataly midjourney-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min.png" alt="Customer purchase intentions sometimes follow patterns that can be used for predictive purposes. Image created with Midjourney." class="wp-image-12685" srcset="https://www.relataly.com/wp-content/uploads/2023/03/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min.png 478w, https://www.relataly.com/wp-content/uploads/2023/03/men-and-woman-doing-groceries-machine-learning-customer-purchase-intention-prediction-relataly-midjourney-min.png 287w" sizes="(max-width: 478px) 100vw, 478px" /><figcaption class="wp-element-caption">Customer purchase intentions sometimes follow patterns that can be used for predictive purposes. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<h2 class="wp-block-heading">How Modeling Purchase Intentions can Lead to a Better Customer Understanding</h2>



<p class="wp-block-paragraph">Predicting the purchase intentions of online shoppers can be a step for online stores to understand their customers better. Creating predictive models makes it possible to conclude the factors influencing customers&#8217; buying behavior. At what time of day are our customers most inclined to buy? For which products do customers often abandon the purchase process? Such questions are fascinating for marketing departments. Once understood, they can enable marketers to optimize their customers&#8217; buying experience and achieve a higher conversion rate. In this way, intention prediction can help online stores target customers with the right products at the right time and thus take a step toward marketing automation.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="6828" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-13-12/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-13.png" data-orig-size="1846,861" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Classifying Purchase Intentions of Online Shoppers with Python" data-image-description="&lt;p&gt;Classifying Purchase Intentions of Online Shoppers with Python&lt;/p&gt;
" data-image-caption="&lt;p&gt;Classifying Purchase Intentions of Online Shoppers with Python&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-13.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-13-1024x478.png" alt="A classification model that predicts the buying intention of online shoppers" class="wp-image-6828" width="760" height="355" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/image-13.png 1846w" sizes="(max-width: 760px) 100vw, 760px" /></figure>



<h2 class="wp-block-heading" id="h-implementing-a-prediction-model-for-purchase-intentions-with-python">Implementing a Prediction Model for Purchase Intentions with Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Logistic regression is a widely-used algorithm in machine learning that is particularly useful for solving two-class classification problems. One of the primary benefits of using logistic regression models is that they can help us understand the factors that influence the predictions made by the model. This interpretability is a key advantage of logistic regression, making it a popular choice in many real-world applications.</p>



<p class="wp-block-paragraph">In the next steps of our analysis, we will develop a two-class classification model that utilizes the logistic regression algorithm to predict the purchase intentions of online shoppers. By analyzing a set of features that are likely to influence a shopper&#8217;s decision to purchase, such as product price, customer reviews, and shipping time, we can build a model that accurately predicts the likelihood of a shopper completing a purchase. The logistic regression algorithm will be particularly useful in this case, as it allows us to identify which features are the most significant predictors of purchase intention.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_d5d832-9e"><a class="kb-button kt-button button kb-btn_7d1c88-9e kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/02%20Classification/019%20%20Classifying%20Shopper%20Buying%20Intention%20using%20Logistic%20Regression.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_040040-16 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before starting the coding part, make sure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. If you don&#8217;t have an environment, consider the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda Python environment</a>. To set it up, you can follow the steps in&nbsp;<a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a>. Please ensure to install all required packages:</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using the machine learning library <a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener">Scikit-learn</a> and <a data-type="URL" data-id="https://seaborn.pydata.org/" href="https://seaborn.pydata.org/" target="_blank" rel="noreferrer noopener">Seaborn</a> for visualization. You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-about-the-dataset">About the Dataset</h3>



<p class="wp-block-paragraph">In this tutorial, we will be working with a public dataset from <a href="https://www.kaggle.com/roshansharma/online-shoppers-intention" target="_blank" rel="noreferrer noopener">Kaggle.com</a>. The data consists of 18 feature vectors belonging to 12,330 shopping sessions. You can download the data via the link below:</p>



<div class="wp-block-file"><a id="wp-block-file--media-3f304c01-ab35-4462-bda0-88dce356d27e" href="https://www.relataly.com/wp-content/uploads/2020/05/online_shoppers_intention.csv">online_shoppers_intention.csv</a><a href="https://www.relataly.com/wp-content/uploads/2020/05/online_shoppers_intention.csv" class="wp-block-file__button wp-element-button" download aria-describedby="wp-block-file--media-3f304c01-ab35-4462-bda0-88dce356d27e">Download</a></div>



<p class="wp-block-paragraph">The data stems from a big shopping website that has recorded the session for one year. Each record belongs to a separate shopping session and user. Thus, there is no bias in the data, such as a specific period, user, or day to avoid. </p>



<p class="wp-block-paragraph">Below you will find an overview of the features contained in the data (Source: Kaggle.com): </p>



<ul class="wp-block-list">
<li>&#8220;Administrative,&#8221; &#8220;Administrative Duration,&#8221; &#8220;Informational,&#8221; &#8220;Informational Duration,&#8221; &#8220;Product Related,&#8221; and &#8220;Product-Related Duration&#8221; represent the number of different types of pages visited by the visitor in that session and the total time spent in each of these page categories.&nbsp;</li>



<li>The &#8220;Bounce Rate,&#8221; &#8220;Exit Rate,&#8221; and &#8220;Page Value&#8221; features represent the metrics measured by &#8220;Google Analytics&#8221; for each page on the e-commerce site. </li>



<li>The &#8220;Special Day&#8221; feature indicates the closeness of the site visiting time to a specific special day (e.g., Mother&#8217;s Day, Valentine&#8217;s Day)</li>



<li>The dataset also includes an operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is a weekend, and the month of the year.</li>
</ul>



<p class="wp-block-paragraph">The &#8216;Revenue&#8217; attribute is the class label, called the &#8220;prediction label.&#8221;</p>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1 Load the Data</h3>



<p class="wp-block-paragraph">We begin by loading the shopping dataset into a Pandas DataFrame. Afterward, we will print a brief overview of the data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import calendar
import math 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from matplotlib import cm
import seaborn as sns

from sklearn.model_selection import train_test_split as train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, roc_curve, auc, roc_auc_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load train data
filepath = &quot;data/classification-online-shopping/&quot;
df_shopping_base = pd.read_csv(filepath + 'online_shoppers_intention.csv') 
df_shopping_base</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	Administrative	Administrative_Duration	Informational	Informational_Duration	ProductRelated	ProductRelated_Duration	BounceRates	ExitRates	PageValues	SpecialDay	Month	OperatingSystems	Browser	Region	TrafficType	VisitorType			Weekend	Revenue
0	0.0				0.0						0.0				0.0						1.0				0.000000				0.20		0.20		0.0			0.0			Feb		1					1		1		1			Returning_Visitor	False	False
1	0.0				0.0						0.0				0.0						2.0				64.000000				0.00		0.10		0.0			0.0			Feb		2					2		1		2			Returning_Visitor	False	False
2	0.0				-1.0					0.0				-1.0					1.0				-1.000000				0.20		0.20		0.0			0.0			Feb		4					1		9		3			Returning_Visitor	False	False
3	0.0				0.0						0.0				0.0						2.0				2.666667				0.05		0.14		0.0			0.0			Feb		3					2		2		4			Returning_Visitor	False	False
4	0.0				0.0						0.0				0.0						10.0			627.500000				0.02		0.05		0.0			0.0			Feb		3					3		1		4			Returning_Visitor	True	False</pre></div>



<h3 class="wp-block-heading" id="h-step-2-cleaning-the-data">Step #2 Cleaning the Data</h3>



<p class="wp-block-paragraph">Before we can start training our prediction model, we&#8217;ll do some cleanups (handling missing data, data type conversions, treating outliers, and so on).</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Replacing visitor_type to int
print(df_shopping_base['VisitorType'].unique())
df_shop = df_shopping_base.replace({'VisitorType' : { 'New_Visitor' : 0, 'Returning_Visitor' : 1, 'Other' : 2 }})

# Coverting month column to numeric numeric values
monthlist = df_shop['Month'].replace('June', 'Jun')
mlist = []
m = np.array(monthlist)
for mi in m:
    a = list(calendar.month_abbr).index(mi)
    mlist.append(a)
df_shop['Month'] =  mlist

# Delete records with NAs
df_shop.dropna(inplace=True)

df_shop.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">['Returning_Visitor' 'New_Visitor' 'Other']
	Administrative	Administrative_Duration	Informational	Informational_Duration	ProductRelated	ProductRelated_Duration	BounceRates	ExitRates	PageValues	SpecialDay	Month	OperatingSystems	Browser	Region	TrafficType	VisitorType	Weekend	Revenue
  0	0.0				0.0						0.0				0.0						1.0				0.000000				0.20		0.20		0.0			0.0			2		1					1		1		1			1			False	False
1	0.0				0.0						0.0				0.0						2.0				64.000000				0.00		0.10		0.0			0.0			2		2					2		1		2			1			False	False
2	0.0				-1.0					0.0				-1.0					1.0				-1.000000				0.20		0.20		0.0			0.0			2		4					1		9		3			1			False	False
3	0.0				0.0						0.0				0.0						2.0				2.666667				0.05		0.14		0.0			0.0			2		3					2		2		4			1			False	False
4	0.0				0.0						0.0				0.0						10.0			627.50</pre></div>



<h3 class="wp-block-heading" id="h-step-3-exploring-the-data">Step #3 Exploring the Data</h3>



<p class="wp-block-paragraph">Next, we will familiarize ourselves with the data. </p>



<h4 class="wp-block-heading" id="h-3-1-class-labels">3.1 Class Labels</h4>



<p class="wp-block-paragraph">First, we take a look at the class labels to see how balanced they are. If class labels are balanced, it means that each class has an approximately equal number of examples in the training data. This is important because it helps ensure that the trained model will be able to make accurate predictions on new data. If the class labels are unbalanced, then the model is more likely to be biased towards the more common classes, which can lead to poor performance on less common classes. Additionally, unbalanced class labels can make it more difficult to evaluate the performance of a machine learning model, because the model&#8217;s accuracy may not be an accurate reflection of its ability to generalize to new data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Checking the balance of prediction labels
plt.figure(figsize=(16,2))
fig = sns.countplot(y=&quot;Revenue&quot;, data=df_shop, palette=&quot;muted&quot;)
plt.show()</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="6830" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/output-3-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/output-3.png" data-orig-size="953,154" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output-3" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/output-3.png" src="https://www.relataly.com/wp-content/uploads/2022/04/output-3.png" alt="" class="wp-image-6830" width="946" height="153" srcset="https://www.relataly.com/wp-content/uploads/2022/04/output-3.png 953w, https://www.relataly.com/wp-content/uploads/2022/04/output-3.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/output-3.png 768w" sizes="(max-width: 946px) 100vw, 946px" /></figure>



<p class="wp-block-paragraph">Our class labels are somewhat imbalanced, as there are much more cases in the data with a prediction &#8220;false.&#8221; The reason is that more visitors won&#8217;t buy anything. Imbalanced data can affect the performance of classification models. But now that we are aware of the imbalance in our data, we can choose appropriate evaluation metrics later.</p>



<h4 class="wp-block-heading" id="h-3-2-feature-correlation">3.2 Feature Correlation</h4>



<p class="wp-block-paragraph">When developing classification models, not all features are usually equally useful. It is important that features are not correlated because correlated features can provide redundant information to a machine learning model. If two or more features are highly correlated, they may convey the same information to the model, which can make the model&#8217;s predictions less accurate. Additionally, having correlated features can make it more difficult to interpret the model&#8217;s predictions, because it is not clear which features are actually contributing to the model&#8217;s decision-making process. </p>



<p class="wp-block-paragraph">Let&#8217;s check which of our features are correlated. First, we will create a series of Whiskerplots for the features in our dataset. They help us identify potential outliers and get a better idea of how the data looks.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Whiskerplots
c= 'black'
df_shop.drop('Revenue', axis=1).plot(kind='box', 
                                subplots=True, layout=(4,4), 
                                sharex=False, sharey=False, 
                                figsize=(14,14), 
                                title='Whister plot for input variables')
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="986" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-35-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/05/image-35.png" data-orig-size="821,893" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-35" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/05/image-35.png" src="https://www.relataly.com/wp-content/uploads/2020/05/image-35.png" alt="Purchase Intention Prediction, Feature Permutation Importance, Feature Correlation plot" class="wp-image-986" width="664" height="721" srcset="https://www.relataly.com/wp-content/uploads/2020/05/image-35.png 821w, https://www.relataly.com/wp-content/uploads/2020/05/image-35.png 276w, https://www.relataly.com/wp-content/uploads/2020/05/image-35.png 768w" sizes="(max-width: 664px) 100vw, 664px" /><figcaption class="wp-element-caption">Feature Whiskerplots</figcaption></figure>



<p class="wp-block-paragraph">The Whiskerplots show that there are a couple of outliers in the data. However, the outliers are not significant enough to worry about them.</p>



<p class="wp-block-paragraph">Histograms are another way of visualizing the distribution of numerical or categorical variables. They give a rough sense of the density of the distribution. To create the histograms, run the code below.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># # Create pariplots for feature columns separated by prediction label value
df_plot = df_shop.copy()

# class_columnname = 'Revenue'
sns.pairplot(df_plot, hue=&quot;Revenue&quot;, height=2.5)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="6829" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/shopper-buying-intention/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png" data-orig-size="2560,2485" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Shopper-Buying-Intention pair plots with seaborn" data-image-description="&lt;p&gt;Shopper-Buying-Intention pair plots with seaborn&lt;/p&gt;
" data-image-caption="&lt;p&gt;Shopper-Buying-Intention pair plots with seaborn&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png" src="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention-1024x994.png" alt="Purchase Intention Prediction, Feature Permutation Importance, Feature Correlation plot" class="wp-image-6829" width="1117" height="1085" srcset="https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 2048w, https://www.relataly.com/wp-content/uploads/2022/04/Shopper-Buying-Intention.png 2475w" sizes="(max-width: 1117px) 100vw, 1117px" /></figure>



<p class="wp-block-paragraph">Finally, we create a correlation matrix and visualize it as a heat map. The matrix provides a quick overview of which features are correlated and not.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Feature correlation
plt.figure(figsize=(15,4))
f_cor = df_shop.corr()
sns.heatmap(f_cor, cmap=&quot;Blues_r&quot;)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4662" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-50-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-50.png" data-orig-size="899,367" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-50" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-50.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-50.png" alt="Purchase Intention Prediction, Feature Permutation Importance" class="wp-image-4662" width="674" height="275" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-50.png 899w, https://www.relataly.com/wp-content/uploads/2021/06/image-50.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-50.png 768w" sizes="(max-width: 674px) 100vw, 674px" /></figure>



<p class="wp-block-paragraph">The correlation plot shows that some features are highly correlated. The following features are highly correlated:</p>



<ul class="wp-block-list">
<li>ProductRelated and ProductRelated_Duration. </li>



<li>BounceRates and ExitRates</li>
</ul>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">plt.figure(figsize=(8,5))
sns.scatterplot(x= 'BounceRates',y='ExitRates',data=df_shop,hue='Revenue')
plt.title('Bounce Rate vs. Exit Rate', fontweight='bold', fontsize=15)
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4674" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-51-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-51.png" data-orig-size="510,335" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-51" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-51.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-51.png" alt="Purchase Intention Prediction, Feature Permutation Importance" class="wp-image-4674" width="537" height="352" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-51.png 510w, https://www.relataly.com/wp-content/uploads/2021/06/image-51.png 300w" sizes="(max-width: 537px) 100vw, 537px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">plt.figure(figsize=(8,5))
sns.scatterplot(x= 'ProductRelated',y='ProductRelated_Duration',data=df_shop,hue='Revenue')
plt.title('Bounce Rate vs. Exit Rate', fontweight='bold', fontsize=15)
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4675" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-52-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-52.png" data-orig-size="514,335" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-52" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-52.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-52.png" alt="Purchase Intention Prediction, Feature Permutation Importance" class="wp-image-4675" width="528" height="343" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-52.png 514w, https://www.relataly.com/wp-content/uploads/2021/06/image-52.png 300w" sizes="(max-width: 528px) 100vw, 528px" /></figure>



<p class="wp-block-paragraph">When we start to train our model, we will only use one of the features from the two pairs.</p>



<h3 class="wp-block-heading" id="h-step-4-data-preprocessing">Step #4 Data Preprocessing </h3>



<p class="wp-block-paragraph">Now that we are familiar with the data, we can prepare the data to train the purchase intention classification model. Firstly, we will include only selecting the features from the original shopping dataset. Second, we will split the data into two separate datasets: train and test with a ratio of 70%. Train X_train and X_test datasets contain the features, while y_train and y_test include the respective prediction labels. Thirdly, we will use the MinMaxScaler to scale the numeric features between 0 and 1. Scaling makes it easier for the algorithm to interpret the data and improve classification performance.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Separate labels from training data
features = ['Administrative', 'Administrative_Duration', 'Informational', 
            'Informational_Duration', 'ProductRelated', 'BounceRates', 'PageValues', 
            'Month', 'Region', 'TrafficType', 'VisitorType']
X = df_shop[features] #Training data
y = df_shop['Revenue'] #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

# Scale the numeric values
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)</pre></div>



<h3 class="wp-block-heading" id="h-step-5-train-a-purchase-intention-classifier">Step #5 Train a Purchase Intention Classifier</h3>



<p class="wp-block-paragraph">Next, it is time to train our prediction model. Various classification algorithms could be used to solve this problem, for example, decision trees, random forests, neural networks, or support-vector machines. We will use the logistic regression algorithm, a common choice for simple two-class prediction problems. </p>



<p class="wp-block-paragraph">We start the training process using the &#8220;fit&#8221; method of the logistic regression algorithm. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Training a classification model using logistic regression 
logreg = LogisticRegression(solver='lbfgs')
score = logreg.fit(X_train, y_train).decision_function(X_test)</pre></div>



<p class="wp-block-paragraph">The trained model returns a training score showing how well the model has performed on the test dataset. </p>



<h3 class="wp-block-heading" id="h-step-6-evaluate-model-performance">Step #6 Evaluate Model Performance</h3>



<p class="wp-block-paragraph">Finally, we will evaluate the performance of our classification model. For this purpose, we first create a confusion matrix. Then we calculate and compare different error metrics.</p>



<h4 class="wp-block-heading" id="h-6-1-confusion-matrix">6.1 Confusion Matrix</h4>



<p class="wp-block-paragraph">The confusion matrix is a holistic and clean way to illustrate the results of a classification model. It differentiates between predicted labels and actual labels. For a binary classification model, the matrix comprises 2&#215;2 quadrants that show the number of cases in each quadrant. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># create a confusion matrix
y_pred = logreg.predict(X_test)
cnf_matrix = confusion_matrix(y_test, y_pred)

# create heatmap
%matplotlib inline
class_names=[False, True] # name  of classes
fig, ax = plt.subplots(figsize=(7, 6))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap=&quot;YlGnBu&quot;, fmt='g')
ax.xaxis.set_label_position(&quot;top&quot;)
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="990" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-39-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/05/image-39.png" data-orig-size="492,452" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-39" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/05/image-39.png" src="https://www.relataly.com/wp-content/uploads/2020/05/image-39.png" alt="confusion matrix on the results of our classification model that predicts purchase intentions, purchase intention prediction model" class="wp-image-990" width="374" height="344" srcset="https://www.relataly.com/wp-content/uploads/2020/05/image-39.png 492w, https://www.relataly.com/wp-content/uploads/2020/05/image-39.png 300w" sizes="(max-width: 374px) 100vw, 374px" /></figure>



<p class="wp-block-paragraph">In the upper left (0,0), we see that the model correctly predicted for 3102 online shopping sessions that these sessions will not lead to a purchase (True negatives). In 30 cases, the model was wrong and expected that there would be a purchase, but there wasn&#8217;t (False positives). For 412 buyers, the model predicted that they would not buy anything, even though they were buying something (False negatives). In the lower right corner, we see that only in 151 cases could buyers be correctly identified as such (True positives). </p>



<h4 class="wp-block-heading" id="h-6-2-performance-metrics-for-classification-models">6.2 Performance Metrics for Classification Models</h4>



<p class="wp-block-paragraph">Next, let&#8217;s take a brief look at the performance metrics. Four standard metrics that measure the performance of classification models are Accuracy, Precision, Recall, and  f1_score. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_pred)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_pred)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_pred)))
print('f1_score: {:.2f}'.format(f1_score(y_test, y_pred)))</pre></div>



<h5 class="wp-block-heading" id="h-accuracy"><strong>Accuracy</strong></h5>



<p class="wp-block-paragraph">The accuracy of the test set shows that 88% of the online shopper sessions were correctly classified. However, our data is imbalanced. That is to say, most labels have the value &#8220;False,&#8221; and only a few target labels are &#8220;True.&#8221; Consequently, we must ensure that our model does not classify all online shoppers as &#8220;non-buyers&#8221; (label: False) but also correctly predicts the buyers (label: True). </p>



<h5 class="wp-block-heading" id="h-precision"><strong>Precision</strong></h5>



<p class="wp-block-paragraph">We calculate the precision as the number of True Positives divided by the number of True Positives and False Positives. Similar to Accuracy, Precision puts too much emphasis on the True negatives. Therefore, it does not say much about our model. The precision score for our model is just a little lower than the accuracy (83%).</p>



<h5 class="wp-block-heading" id="h-recall"><strong>Recall</strong></h5>



<p class="wp-block-paragraph">We calculate the Recall&nbsp;by dividing the number of True Positives by the sum of the True Positives and the False Negatives. The Recall of our model is 27%, which is significantly below accuracy and precision. In our case, the precision call is more meaningful than precision and Recall because it puts a higher penalty on the low number of True positives.</p>



<h5 class="wp-block-heading" id="h-f1-score"><strong>F1-Score</strong></h5>



<p class="wp-block-paragraph">The formula for the F1-Score is 2*((precision*recall)/(precision+recall)). Because the formula includes the Recall, the F-1 Score of our model is only 41%. Imagine we want to optimize our classification model further. In this case, we should look out for both F1-Score and Recall.</p>



<h4 class="wp-block-heading" id="h-6-3-interpretation">6.3 Interpretation</h4>



<p class="wp-block-paragraph">Metrics for classification models can be misleading. We should thus choose them carefully. Depending on which use case we are dealing with, False-negative and False-positive predictions can have different costs. Therefore, model evaluation is not always about exactness (precision and accuracy). Instead, the choice of performance metrics depends on what we want to achieve.</p>



<p class="wp-block-paragraph">The challenge for our model is to correctly classify the smaller group of buyers (True positives). So, optimizing our model would be about achieving a balance between good accuracy without significantly lowering the F1_Score and Recall.</p>



<h3 class="wp-block-heading" id="h-step-7-insights-on-customer-purchase-intentions">Step #7 Insights on Customer Purchase Intentions</h3>



<p class="wp-block-paragraph">Finally, we will use permutation feature importance to gain additional insights into our prediction model&#8217;s features. Permutation Feature Importance is a technique that measures the influence of features on the predictions of our model. Features with a high positive or negative score substantially impact predicting the prediction label. In contrast, features with scores close to zero play a lesser role in the predictions.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Load the data
r = permutation_importance(model_lgr, X_test, y_test, n_repeats=30, random_state=0)

# Plot the barchart
data_im = pd.DataFrame(r.importances_mean, columns=['feature_permuation_score'])
data_im['feature_names'] = X.columns
data_im = data_im.sort_values('feature_permuation_score', ascending=False)

fig, ax = plt.subplots(figsize=(16, 5))
sns.barplot(y=data_im['feature_names'], x=&quot;feature_permuation_score&quot;, data=data_im, palette='nipy_spectral')
ax.set_title(&quot;Logistic Regression Feature Importances&quot;)</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="326" data-attachment-id="4684" data-permalink="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/image-56-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-56.png" data-orig-size="1050,334" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-56" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-56.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-56-1024x326.png" alt="online purchase intention prediction - results of the feature permutation importance technique" class="wp-image-4684" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-56.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-56.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-56.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-56.png 1050w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">We can see that the three features with the highest impact are PageValues, BounceRates and Administration_Duration. </p>



<ul class="wp-block-list">
<li>The higher the page&#8217;s value, the higher the customer&#8217;s chance to make a purchase. </li>



<li>The higher the average bounce rate that the customer visits, the higher the chance the customer makes a purchase.</li>



<li>In contrast, the more time a customer spends on administrative settings, the lower the chance the customer completes the purchase.</li>
</ul>



<p class="wp-block-paragraph">These were just a few sample findings. There is much more to explore in the data, and deeper analysis can uncover much more about the customers&#8217; buying decisions.</p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p class="wp-block-paragraph">This article has presented customer purchase prediction as an interesting use case for machine learning in e-commerce. After discussing the use case, we have developed a classification model that predicts the purchase intentions of online shoppers. You have learned to preprocess the data, train a logistic regression model and evaluate the model&#8217;s performance. Classifying purchase intentions can help online shops understand their customers better and automate certain online marketing activities. The previous section showed how marketers could use this to gain further insights into their customers&#8217; behavior.</p>



<p class="wp-block-paragraph">Thanks for reading and if you have any questions, let me know in the comments. </p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p class="wp-block-paragraph">I hope this article was helpful. If you have any remarks or questions, please write them in the comments. </p>



<div style="display: inline-block;">
  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1999579577&amp;asins=1999579577&amp;linkId=91d862698bf9010ff4c09539e4c49bf4&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1839217715&amp;asins=1839217715&amp;linkId=356ba074068849ff54393f527190825d&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/">Classifying Purchase Intention of Online Shoppers with Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/predicting-the-purchase-intention-of-online-shoppers/982/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">982</post-id>	</item>
	</channel>
</rss>
