<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Advanced Tutorials Archives - relataly.com</title>
	<atom:link href="https://www.relataly.com/tag/advanced-tutorials/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.relataly.com/tag/advanced-tutorials/</link>
	<description>The Business AI Blog</description>
	<lastBuildDate>Sat, 27 May 2023 10:37:46 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://www.relataly.com/wp-content/uploads/2023/04/cropped-AI-cat-Icon-White.png</url>
	<title>Advanced Tutorials Archives - relataly.com</title>
	<link>https://www.relataly.com/tag/advanced-tutorials/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">175977316</site>	<item>
		<title>Feature Engineering and Selection for Regression Models with Python and Scikit-learn</title>
		<link>https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/</link>
					<comments>https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Mon, 26 Sep 2022 22:20:29 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Exploratory Data Analysis (EDA)]]></category>
		<category><![CDATA[Feature Engineering]]></category>
		<category><![CDATA[Feature Permutation Importance]]></category>
		<category><![CDATA[Linear Regression]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Measuring Model Performance]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Random Decision Forests]]></category>
		<category><![CDATA[Sales Forecasting]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Simple Regression]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[Advanced Tutorials]]></category>
		<category><![CDATA[AI in Finance]]></category>
		<category><![CDATA[Deep Learning]]></category>
		<category><![CDATA[Feature Engineering for Time Series Forecasting]]></category>
		<category><![CDATA[Feature Exploration]]></category>
		<category><![CDATA[Feature Selection]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<category><![CDATA[Neural Networks]]></category>
		<category><![CDATA[Price Regression]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=8832</guid>

					<description><![CDATA[<p>Training a machine learning model is like baking a cake: the quality of the end result depends on the ingredients you put in. If your input data is poor, your predictions will be too. But with the right ingredients &#8211; in this case, carefully selected input features &#8211; you can create a model that&#8217;s both ... <a title="Feature Engineering and Selection for Regression Models with Python and Scikit-learn" class="read-more" href="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/" aria-label="Read more about Feature Engineering and Selection for Regression Models with Python and Scikit-learn">Read more</a></p>
<p>The post <a href="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/">Feature Engineering and Selection for Regression Models with Python and Scikit-learn</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Training a machine learning model is like baking a cake: the quality of the end result depends on the ingredients you put in. If your input data is poor, your predictions will be too. But with the right ingredients &#8211; in this case, carefully selected input features &#8211; you can create a model that&#8217;s both accurate and powerful. This is where feature engineering comes in. It&#8217;s the process of exploring, creating, and selecting the most relevant and useful features to use in your model. And just like a chef experimenting with different spices and flavors, the process of feature engineering is iterative and tailored to the problem at hand. In this guide, we&#8217;ll walk you through a step-by-step process using Python and Scikit-learn to create a strong set of features for a regression problem. By the end, you&#8217;ll have the skills to tackle any feature engineering challenge that comes your way.</p>



<p class="wp-block-paragraph">The remainder of this article proceeds as follows: We begin with a brief intro to feature engineering and describe valuable techniques. We then turn to the hands-on part, in which we develop a regression model for car sales. We apply various techniques that show how to handle outliers and missing values, perform correlation analysis, and discover and manipulate features. You will also find information about common challenges and helpful sklearn functions. Finally, we will compare our regression model to a baseline model that uses the original dataset.</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/simple-sentiment-analysis-using-naive-bayes-and-logistic-regression/2007/" target="_blank" rel="noreferrer noopener">Sentiment Analysis with Naive Bayes and Logistic Regression in Python</a></p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<div style="height:31px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading">What is Feature Engineering?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Feature engineering is the process of using domain knowledge of the data to create features (variables) that make machine learning algorithms work. This is an important step in the machine learning pipeline because the choice of good features can greatly affect the performance of the model. The goal is to identify features, tweak them, and select the most promising ones into a smaller feature subset. We can break this process down into several action items. </p>



<p class="wp-block-paragraph">Data Scientists can easily spend 70% to 80% of their time on feature engineering. The time is well spent, as changes to input data have a direct impact on performance. This process is often iterative and requires repeatedly revisiting the various tasks as understanding the data and the problem evolves. Knowing techniques and associated challenges helps in adequate feature engineering.</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/mastering-prompt-engineering-for-chatgpt-a-practical-guide-for-businesses/13134/" target="_blank" rel="noreferrer noopener">Mastering Prompt Engineering for ChatGPT for Business Use</a></p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="1024" height="1024" data-attachment-id="12411" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/engineering-features-python-tutorial-machine-learning/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png" data-orig-size="1024,1024" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="engineering-features-python-tutorial-machine-learning" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png" src="https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning-1024x1024.png" alt="Engineering features python tutorial machine learning. Image of an engineer working on a technical document. Midjourney. relataly.com" class="wp-image-12411" srcset="https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png 1024w, https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png 140w, https://www.relataly.com/wp-content/uploads/2023/02/engineering-features-python-tutorial-machine-learning.png 768w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Feature engineering is about carefully choosing features instead of taking all the features at once. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<p class="wp-block-paragraph"></p>



<h3 class="wp-block-heading">Core Tasks</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The goal of feature engineering is to create a set of features that are representative of the underlying data and that can be used by the machine learning algorithm to make accurate predictions. Several tasks are commonly performed as part of the feature engineering process, including:</p>



<ul class="wp-block-list">
<li><strong>Data discovery</strong>: To solve real-world problems with analytics, it is crucial to understand the data. Once you have gathered your data, describing and visualizing the data are means to familiarize yourself with it and develop a general feel for the data. </li>



<li><strong>Data structuring:</strong> The data needs to be structured into a unified and usable format. Variables may have a wrong datatype, or the data is distributed across different data frames and must first be merged. In these cases, we first need to bring the data together and into the right shape.</li>



<li><strong>Data cleansing:</strong> Besides being structured, data needs to be cleaned. Records may be redundant or contaminated with errors and missing values that can hinder our model from learning effectively. The same goes for outliers that can distort statistics. </li>



<li><strong>Data transformation:</strong> We can increase the predictive power of our input features by transforming them. Activities may include applying mathematical functions, removing specific data, or grouping variables into bins. Or we create entirely new features out of several existing ones. </li>



<li><strong>Feature selection: </strong>Only some may contain valuable information from the many available variables. By sorting variables that are less relevant and selecting the most promising features, we can create models that are less complex and yield better results.</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading">Exploratory Feature Engineering Toolset</h3>



<p class="wp-block-paragraph">Exploratory analysis for identifying and assessing relevant features knows several tools: </p>



<ul class="wp-block-list">
<li>Data Cleansing</li>



<li>Descriptive statistics</li>



<li>Univariate Analysis</li>



<li>Bi-variate Analysis</li>



<li>Multivariate Analysis</li>
</ul>



<h2 class="wp-block-heading">Data Cleansing</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Educational data is often remarkably perfect, without any errors or missing values. However, it is important to recognize that most real-world data has data quality issues. Some reasons for data quality issues are </p>



<ul class="wp-block-list">
<li>Standardization issues because the data was recorded from different peoples, sensor types, etc.</li>



<li>Sensor or system outages can lead to gaps in the data or create erroneous data points.</li>



<li>Human errors</li>
</ul>



<p class="wp-block-paragraph">An important part of feature engineering is to inspect the data and ensure its quality before use. This is what we understand as &#8220;data cleansing.&#8221; It includes several tasks that aim to improve the data quality, remove erroneous data points and bring the data into a more useful form. </p>



<ul class="wp-block-list">
<li>Cleaning errors, missing values, and other issues.</li>



<li>Handling possible imbalanced data </li>



<li>Removing obvious outliers</li>



<li>Standardisation, e.g., dates or adresses </li>
</ul>



<p class="wp-block-paragraph">Accomplishing these tasks requires a good understanding of the data. We, therefore, carry out data cleansing activities closely intertwined with other exploratory tasks, e.g., univariate and bivariate data analysis. Also, remember that visualizations can aid in the process, as they can greatly enhance your ability to analyze and understand the data. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h4 class="wp-block-heading">Descriptive Statistics</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">One of the first steps in familiarizing oneself with a new dataset is to use descriptive statistics. Descriptive statistics help understand the data and how the sample represents the real-world population. We can use several statistical measures to analyze and describe a dataset, including the following:</p>



<ul class="wp-block-list">
<li><strong>Measures of Central Tendency</strong> represent a typical value of the data.
<ul class="wp-block-list">
<li><strong>The mean:</strong> The average-based adds together all values in the sample and divides them by the number of samples.</li>



<li><strong>The median</strong>: The median is the value that lies in the middle of the range of all sample values</li>



<li><strong>The mode: </strong>is the most occurring value in a sample set (for categorical variables)</li>
</ul>
</li>



<li><strong>Measures of Variability</strong> tell us something about the spread of the data.
<ul class="wp-block-list">
<li><strong>Range:</strong> The difference between the minimum and maximum value</li>



<li><strong>Variance:</strong> This is the average of the squared difference of the mean.</li>



<li><strong>Standard Deviation:</strong> The square root of the variance.</li>
</ul>
</li>



<li>and <strong>Measures of Frequency</strong> inform us how often we can expect a value to be present in the data, e.g., value counts</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<p class="wp-block-paragraph"><strong>Univariate Analysis</strong></p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">As &#8220;uni&#8221; suggests, the univariate analysis focuses on a single variable. Rather than examining the relationships between the variables, univariate analysis employs descriptive statistics and visualizations to understand individual columns better.</p>



<p class="wp-block-paragraph">Which illustrations and measures we use depends on the type of the variable.</p>



<p class="wp-block-paragraph"><strong>Categorical variables (incl. binary)</strong></p>



<ul class="wp-block-list">
<li>Descriptive measures include counts in percent and absolute values</li>



<li>Visualizations include pie charts, bar charts (count plots)</li>
</ul>



<p class="wp-block-paragraph"><strong>Continuous variables</strong></p>



<ul class="wp-block-list">
<li>Descriptive measures include min, max, median, mean, variance, standard deviation, and quantiles.</li>



<li>Visualizations include box plots, line plots, and histograms.</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-full"><img decoding="async" width="838" height="585" data-attachment-id="9261" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/output-9/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/output.png" data-orig-size="838,585" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Normal distribution" data-image-description="&lt;p&gt;Normal distribution, univariate analysis&lt;/p&gt;
" data-image-caption="&lt;p&gt;Normal distribution, univariate analysis&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/output.png" src="https://www.relataly.com/wp-content/uploads/2022/09/output.png" alt="" class="wp-image-9261" srcset="https://www.relataly.com/wp-content/uploads/2022/09/output.png 838w, https://www.relataly.com/wp-content/uploads/2022/09/output.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/output.png 768w" sizes="(max-width: 838px) 100vw, 838px" /><figcaption class="wp-element-caption">Normal distribution, univariate analysis</figcaption></figure>



<figure class="wp-block-image size-full"><img decoding="async" width="751" height="194" data-attachment-id="9293" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-12-15/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-12.png" data-orig-size="751,194" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-12" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-12.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-12.png" alt="" class="wp-image-9293" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-12.png 751w, https://www.relataly.com/wp-content/uploads/2022/09/image-12.png 300w" sizes="(max-width: 751px) 100vw, 751px" /></figure>
</div>
</div>



<h4 class="wp-block-heading">Bi-variate Analysis </h4>



<p class="wp-block-paragraph">Bi-variate (two-variate) analysis is a kind of statistical analysis that focuses on the relationship between two variables, for example, between a feature column and the target variable. In the case of machine learning projects, bivariate analysis can help to identify features that are potentially predictive of the label or the regression target. </p>



<p class="wp-block-paragraph">Model performance will benefit from strong linear dependencies. In addition, we are also interested in examining the relationships among the features used to train the model. Different types of relations exist that can be examined using various plots and statistical measures:</p>



<h4 class="wp-block-heading">Numerical/Numerical</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Both variables have numerical values. We can illustrate their relation using lineplots or dot plots. We can examine such relations with <a href="https://www.relataly.com/category/data-science/pearson-correlation/" target="_blank" rel="noreferrer noopener">correlation analysis</a>.</p>



<p class="wp-block-paragraph">The ideal feature subset contains features that are not correlated with each other but are heavily correlated with the target variable. We can use dimensionality reduction to reduce a dataset with many features to a lower-dimensional space in which the remaining features are less correlated.</p>



<p class="wp-block-paragraph">Traditional correlation analysis (e.g., Pearson) cannot consider non-linear relations. We can identify such a relation manually by visualizing the data, for example, using line plots. Once we denote a non-linear relation, we could try to apply mathematical transformations to one of the variables to make their relation more linear. </p>



<p class="wp-block-paragraph">For pairwise analysis, we must understand which variables we deal with. We can differentiate between three categories:</p>



<ul class="wp-block-list">
<li>Numerical/Categorical</li>



<li>Numerical/Numerical</li>



<li>Categorical/Categorical</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" src="https://www.relataly.com/wp-content/uploads/2022/09/image-2.png" alt="Heatmaps illustrate the relation between features and a target variable." class="wp-image-9269" width="372" height="328"/><figcaption class="wp-element-caption">Heatmaps illustrate the relation between features and a target variable.</figcaption></figure>
</div>
</div>



<div style="height:36px" aria-hidden="true" class="wp-block-spacer"></div>



<h4 class="wp-block-heading">Numerical/Categorical</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Plots that visualize the relationship between a categorical and a numerical variable include barplots and lineplots. </p>



<p class="wp-block-paragraph">Especially helpful are histograms (count plots). They can highlight differences in the distribution of the numerical variable for different categories.</p>



<p class="wp-block-paragraph">A specific subcase is a numerical/date relation. Such relations are typically visualized using line plots. In addition, we want to look out for linear or non-linear dependencies. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9286" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-6-15/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-6.png" data-orig-size="764,406" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-6" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-6.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-6.png" alt="the lineplot is useful for feature exploration and engineering" class="wp-image-9286" width="379" height="201" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-6.png 764w, https://www.relataly.com/wp-content/uploads/2022/09/image-6.png 300w" sizes="(max-width: 379px) 100vw, 379px" /><figcaption class="wp-element-caption">Line charts are useful when examining trends.</figcaption></figure>
</div>
</div>



<h4 class="wp-block-heading">Categorical/Categorical</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The relation between two categorical variables can be studied, including density plots, histograms, and bar plots.</p>



<p class="wp-block-paragraph">For example, with car types (attributes: sedan and coupe) and colors (characteristics: red, blue, yellow), we can use a barplot to see if sedans are more often red than coupes. Differences in the distribution of characteristics can be a starting point for attempts to manipulate the features and improve model performance. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9291" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-11-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-11.png" data-orig-size="765,396" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-11" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-11.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-11.png" alt="the barplot is useful for feature exploration and engineering" class="wp-image-9291" width="374" height="194" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-11.png 765w, https://www.relataly.com/wp-content/uploads/2022/09/image-11.png 300w" sizes="(max-width: 374px) 100vw, 374px" /><figcaption class="wp-element-caption">Bar and column charts are a great way to compare numeric values for discrete categories visually.</figcaption></figure>
</div>
</div>



<h4 class="wp-block-heading">Multivariate Analysis</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph"><em>Multivariate</em> analysis encompasses the simultaneous analysis of more than two variables. The approach can uncover multi-dimensional dependencies and is often used in advanced feature engineering. For example, you may find that two variables are weakly correlated with the target variable, but when combined, their relation intensifies. So you might try to create a new feature that uses the two variables as input. Plots that can visualize relations between several variables include dot plots and violin plots.</p>



<p class="wp-block-paragraph">In addition, multivariate analysis refers to techniques to reduce the dimensionality of a dataset. For example, principal component analysis (PCA) or factor analysis can condense the information in a data set into a smaller number of synthetic features.</p>



<p class="wp-block-paragraph">Now that we have a good understanding of what feature selection techniques are available, we can start the practical part and apply them.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9282" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-4-20/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-4.png" data-orig-size="738,409" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-4" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-4.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-4.png" alt="the scatterplot is useful for feature exploration and engineering" class="wp-image-9282" width="377" height="209" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-4.png 738w, https://www.relataly.com/wp-content/uploads/2022/09/image-4.png 300w" sizes="(max-width: 377px) 100vw, 377px" /><figcaption class="wp-element-caption">Scatter charts are useful when you want to compare two numeric quantities and see a relationship or correlation between them.</figcaption></figure>



<figure class="wp-block-image size-full"><img decoding="async" width="743" height="405" data-attachment-id="9294" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-13-7/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-13.png" data-orig-size="743,405" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-13" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-13.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-13.png" alt="the violin plot is useful for feature exploration and engineering" class="wp-image-9294" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-13.png 743w, https://www.relataly.com/wp-content/uploads/2022/09/image-13.png 300w" sizes="(max-width: 743px) 100vw, 743px" /></figure>
</div>
</div>



<div style="height:100px" aria-hidden="true" class="wp-block-spacer"></div>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/cryptocurrency-price-charts-with-color-overlay-python/2820/" target="_blank" rel="noreferrer noopener">Color-Coded Cryptocurrency Price Charts in Python</a></p>



<h2 class="wp-block-heading" id="h-feature-engineering-for-car-price-regression-with-python-and-scikit-learn">Feature Engineering for Car Price Regression with Python and Scikit-learn</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The value of a car on the market depends on various factors. The distance traveled with the vehicle and the year of manufacture is obvious dependencies. But beyond that, we can use many other factors to train a machine learning model that predicts the selling price of the used car market. The following hands-on Python tutorial will create such a model. We will work with a dataset containing used cars&#8217; characteristics in the following. For marketing, it is crucial to understand what car characteristics determine the price of a vehicle. Our goal is to model the car price from the available independent variables. We aim to build a model that performs well on a small but powerful input subset. </p>



<p class="wp-block-paragraph">Exploring and creating features varies between different application domains. For example, feature engineering in computer vision will differ greatly from feature engineering for regression or classification models or NLP models. So the example provided in this article is just for regression models.</p>



<p class="wp-block-paragraph">We follow an exploratory process that includes the following steps:</p>



<ol class="wp-block-list">
<li>Loading the data</li>



<li>Cleaning the data</li>



<li>Univariate analysis</li>



<li>Bivariate analysis</li>



<li>Selecting features</li>



<li>Data preparation </li>



<li>Model training</li>



<li>Measuring performance</li>
</ol>



<p class="wp-block-paragraph">Finally, we compare the performance of our model, which was trained on a minimal set of features, to a model that uses the original data.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="512" height="512" data-attachment-id="12810" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png" data-orig-size="1024,1024" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png" src="https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney-512x512.png" alt="Yes, you can judge by the length of the beard that this guy is a legendary feature engineer. Image created with Midjourney." class="wp-image-12810" srcset="https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png 512w, https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png 140w, https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png 768w, https://www.relataly.com/wp-content/uploads/2023/03/Dwarf-blacksmith-machine-learning-python-feature-engineering-relataly-midjourney.png 1024w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">Yes, you can judge by the length of the beard that this guy is a legendary feature engineer. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<p class="wp-block-paragraph">The Python code is available in the relataly GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_f9d778-26"><a class="kb-button kt-button button kb-btn_d0af05-38 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/11%20Hyperparamter%20Tuning/015%20Hyperparameter%20Tuning%20of%20Regression%20Models%20using%20Random%20Search.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_7b2495-91 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly Github Repo </span></a></div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before you proceed, ensure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python</a> environment (3.8 or higher) and the required packages. If you don&#8217;t have an environment, follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>



<p class="wp-block-paragraph">Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>



<li>Seaborn</li>



<li>Scikit-learn</li>
</ul>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading">About the Dataset</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this tutorial, we will be working with a dataset containing listings for 111763&nbsp;used cars. The data includes 13 variables, including the dependent target variable</p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<ul class="wp-block-list">
<li><strong>prod_date:</strong> The year of production</li>



<li><strong>maker: </strong>The manufacturer&#8217;s name</li>



<li><strong>model: </strong>The car edition</li>



<li><strong>trim: </strong>Different versions of the model</li>



<li><strong>body_type: </strong>The body style of a vehicle</li>



<li><strong>transmission_type: </strong>The way the power is brought to the wheels</li>



<li><strong>state</strong>: The state in which the car is auctioned</li>



<li><strong>condition</strong>: The condition of the cars</li>



<li><strong>odometer</strong>: The distance the car has traveled since manufactured</li>



<li><strong>exterior_color</strong>: Exterior color</li>



<li><strong>interior_color</strong>: Interior color</li>



<li><strong>sale_price (target variable):</strong> The price a car was sold </li>



<li><strong>sale_date: </strong>The date on which the car has been sold</li>
</ul>
</div>
</div>



<p class="wp-block-paragraph">The dataset is available for download from <a href="https://www.kaggle.com/datasets/lepchenkov/usedcarscatalog" target="_blank" rel="noreferrer noopener">Kaggle.com</a>, but you can execute the code below and load the data from the relataly GitHub repository.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="505" height="510" data-attachment-id="12429" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png" data-orig-size="505,510" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png" alt="Car price prediction machine learning python tutorial. Image of different cars cartoon style. Midjourney. relataly.com" class="wp-image-12429" srcset="https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png 505w, https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png 297w, https://www.relataly.com/wp-content/uploads/2023/02/artishellen_set_of_elements_cars_different_colored_cars_cartoon_87cde816-541c-4e6e-ba8c-cfa530032760-min.png 140w" sizes="(max-width: 505px) 100vw, 505px" /><figcaption class="wp-element-caption">Car price prediction is a solid use case for machine learning. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<p class="wp-block-paragraph"></p>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1 Load the Data</h3>



<p class="wp-block-paragraph">We begin by importing the necessary libraries and downloading the dataset from the relataly GitHub repository. Next, we will read the dataset into a pandas DataFrame. In addition, we store the name of our regression target variable to &#8216;price_usd,&#8217; which is one of the columns in the initial dataset. The &#8220;.head ()&#8221; function displays the first records of our DataFrame.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Tested with Python 3.8.8, Matplotlib 3.5, Scikit-learn 0.24.1, Seaborn 0.11.1, numpy 1.19.5
from codecs import ignore_errors
import math
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white', {'axes.spines.right': False, 'axes.spines.top': False})
from pandas.api.types import is_string_dtype, is_numeric_dtype 
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.inspection import permutation_importance
from sklearn.model_selection import ShuffleSplit
# Original Data Source: 
# https://www.kaggle.com/datasets/tunguz/used-car-auction-prices
# Load train and test datasets
df = pd.read_csv(&quot;https://raw.githubusercontent.com/flo7up/relataly_data/main/car_prices2/car_prices.csv&quot;)
df.head(3)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	prod_year	maker			model		trim		body_type		transmission_type	state	condition	odometer	exterior_color	interior	sellingprice	date
0	2015		Kia				Sorento		LX			SUV				automatic			ca		5.0			16639.0		white			black		21500	2014-12-16
1	2015		Nissan			Altima		2.5 S		Sedan			automatic			ca		1.0			5554.0		gray			black		10900	2014-12-30
2	2014		Audi			A6	3.0T 	Prestige 	quattro	Sedan	automatic			ca		4.8			14414.0		black			black		49750	2014-12-16</pre></div>



<p class="wp-block-paragraph">We now have a dataframe that contains 12 columns and the dependent target variable we want to predict. </p>



<h3 class="wp-block-heading" id="h-step-2-data-cleansing">Step #2 Data Cleansing</h3>



<p class="wp-block-paragraph">Now that we have loaded the data, we begin with the exploratory analysis. First, we will put it into shape. </p>



<h4 class="wp-block-heading" id="h-2-1-check-names-and-datatypes">2.1 Check Names and Datatypes</h4>



<p class="wp-block-paragraph">If the names in a dataset are not self-explaining, it is easy to get confused with all the data. Therefore, will rename some of the columns and provide clearer names. There is no default naming convention, but striving for consistency, simplicity, and understandability is generally a good idea. </p>



<p class="wp-block-paragraph">The following code line renames some of the columns. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># rename some columns for consistency
df.rename(columns={'exterior_color': 'ext_color', 
                   'interior': 'int_color', 
                   'sellingprice': 'sale_price'}, inplace=True)
df.head(1)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	prod_year	maker	model	trim	body_type	transmission_type	state	condition	odometer	ext_color	int_color	sale_price	date
0	2015		Kia		Sorento	LX		SUV			automatic			ca		5.0			16639.0		white		black		21500		2014-12-16</pre></div>



<p class="wp-block-paragraph">Next, we will check and remove possible duplicates.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check and remove dublicates
print(len(df))
df = df.drop_duplicates()
print(len(df))</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">OUT: 111763, 111763</pre></div>



<p class="wp-block-paragraph">There were no duplicates in the data, which is good.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check datatypes
df.dtypes</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">prod_year              int64
maker                 object
model                 object
trim                  object
body_type             object
transmission_type     object
state                 object
condition            float64
odometer             float64
ext_color             object
int_color             object
sale_price             int64
date                  object
dtype: object</pre></div>



<p class="wp-block-paragraph">We compare the datatypes to the first records we printed in the previous section. Be aware that categorical variables (e.g., of type &#8220;string&#8221;) are shown as &#8220;objects.&#8221; The data types look as expected.</p>



<p class="wp-block-paragraph">Finally, we define our target variable&#8217;s name, &#8220;sale_price.&#8221; The target variable will be our regression target, and we will use its name often. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># consistently define the target variable
target_name = 'sale_price'</pre></div>



<h4 class="wp-block-heading">2.2 Checking Missing Values</h4>



<p class="wp-block-paragraph">Some machine learning algorithms are sensitive to missing values. Handling missing values is, therefore a crucial step in exploratory feature engineering. </p>



<p class="wp-block-paragraph">Let&#8217;s first gain an overview of null values. With a larger DataFrame, it would be inefficient to review all the rows and columns individually for missing values. Instead, we use the sum function and visualize the results to get a quick overview of missing data in the DataFrame.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check for missing values
null_df = pd.DataFrame(df.isna().sum(), columns=['null_values']).sort_values(['null_values'], ascending=False)
fig = plt.subplots(figsize=(16, 6))
ax = sns.barplot(data=null_df, x='null_values', y=null_df.index, color='royalblue')
pct_values = [' {:g}'.format(elm) + ' ({:.1%})'.format(elm/len(df)) for elm in list(null_df['null_values'])]
ax.bar_label(container=ax.containers[0], labels=pct_values, size=12)
ax.set_title('Overview of missing values')</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="384" data-attachment-id="9365" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/missing-values-bar-chart-for-car-price-regression/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png" data-orig-size="1026,385" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="missing-values-bar-chart-for-car-price-regression" data-image-description="&lt;p&gt;overview of missing values in the car price regression dataset&lt;/p&gt;
" data-image-caption="&lt;p&gt;overview of missing values in the car price regression dataset&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png" src="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression-1024x384.png" alt="overview of missing values in the car price regression dataset" class="wp-image-9365" srcset="https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/missing-values-bar-chart-for-car-price-regression.png 1026w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">The bar chart shows that there are several variables with missing values. Variables with many missing values can negatively affect model performance, which is why we should try to treat them. </p>



<h4 class="wp-block-heading">2.3 Overview of Techniques for Handling Missing Values</h4>



<p class="wp-block-paragraph"> There are various ways to handle missing data. The most common options to handle missing values are:</p>



<ul class="wp-block-list">
<li><strong>Custom substitution value:</strong> Sometimes, the information that a value is missing can be important information to a predictive model. We can substitute missing values with a placeholder value such as &#8220;missing&#8221; or &#8220;unknown.&#8221; The approach works particularly well for variables with many missing values. </li>



<li><strong>Statistical filling: </strong>We can fill in a statistically chosen measure, such as the mean or median for numeric variables, or the mode for categorical variables.</li>



<li><strong>Replace using Probabilistic PCA:</strong> PCA uses a linear approximation function that tries to reconstruct the missing values from the data.</li>



<li><strong>Remove entire rows:</strong> It is crucial to ensure that we only use data we know is correct. In those cases, we can drop an entire row if it contains a missing value. This also solves the problem but comes at the cost of losing potentially important information &#8211; especially if the data quantity is small.</li>



<li><strong>Remove the entire column:</strong> It is another alternative way of resolving missing values. This is typically the least option, as we lose an entire feature. </li>
</ul>



<p class="wp-block-paragraph">How we handle missing values can dramatically affect our prediction results. To find the ideal method, it is often necessary to experiment with different techniques. Sometimes, the information that a value is missing can also be important. This occurs when the missing values are not randomly distributed in the data and show a pattern. In such a case, you should create an additional feature that states whether values are missing.</p>



<h4 class="wp-block-heading">2.4 Handle Missing Values</h4>



<p class="wp-block-paragraph">In this example, we will use the median value to fill in the missing values of our numeric variables and the mode to replace the missing values of categorical variables. When we check again, we can see that odometer and condition have no more missing values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># fill missing values with the mean for numeric columns
for col_name in df.columns:
    if (is_numeric_dtype(df[col_name])) and (df[col_name].isna().sum() &gt; 0):
        df[col_name].fillna(df[col_name].median(), inplace=True) # alternatively you could also drop the columns with missing values using .drop(columns=['engine_capacity']) 
print(df.isna().sum())</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">prod_year                0
maker                 2078
model                 2096
trim                  2157
body_type             2641
transmission_type    13135
state                    0
condition                0
odometer                 0
ext_color              173
int_color              173
sale_price               0
date                     0
dtype: int64</pre></div>



<p class="wp-block-paragraph">Next, we handle the missing values of transmission_type by filling them with the mode.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check the distribution of missing values for transmission type
print(df['transmission_type'].value_counts())
# fill values with the mode
df['transmission_type'].fillna(df['transmission_type'].mode()[0], inplace=True)
print(df['transmission_type'].isna().sum())</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">automatic    108198
manual         3565
Name: transmission_type, dtype: int64
0</pre></div>



<p class="wp-block-paragraph">We handle body_type analogs as transmission_type and fill the missing values with the mode. The mode is the value that appears most often in the data. The mode of transmission_type is &#8220;Sedan.&#8221; However, this value is not that prevalent, as half of the cars have other body types, e.g., &#8220;SUV.&#8221; Therefore, we will replace the missing values with &#8220;Unknown.&#8221;</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># check the distribution of missing values for body type
print(df['body_type'].value_counts())
# fill values with 'Unknown'
df['body_type'].fillna(&quot;Unknown&quot;, inplace=True)
print(df['body_type'].isna().sum())</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Sedan                 39955
SUV                   23836
sedan                  8377
suv                    4934
Hatchback              4241
                      ...  
cts-v coupe               2
Ram Van                   1
Transit Van               1
CTS Wagon                 1
beetle convertible        1
Name: body_type, Length: 74, dtype: int64
0</pre></div>



<p class="wp-block-paragraph">Now we have handled most of the missing values in our data. However, some variables are still left, with a few missing values. We will make things easy and simply drop all remaining records with missing values. Considering that we have more than 100k records and only a few variables, we can afford to do this without fear of a severe impact on our model performance. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># remove all other records with missing values
df.dropna(inplace=True)
print(df.isna().sum())</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">prod_year            0
maker                0
model                0
trim                 0
body_type            0
transmission_type    0
state                0
condition            0
odometer             0
ext_color            0
int_color            0
sale_price           0
date                 0
dtype: int64</pre></div>



<p class="wp-block-paragraph">Finally, we check again for missing values and see that everything has been filled. Now, we have a cleansed dataset with 13 columns. </p>



<h4 class="wp-block-heading">2.3 Save a Copy of the Cleaned Data</h4>



<p class="wp-block-paragraph">Before exploring the features, let&#8217;s make a copy of the cleaned data. We will later use this &#8220;full&#8221; dataset to compare the performance of our model with a baseline model.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create a copy of the dataset with all features for comparison reasons
df_all = df.copy()</pre></div>



<h3 class="wp-block-heading">Step #3 Getting started with Statistical Univariate Analysis</h3>



<p class="wp-block-paragraph">Now it&#8217;s time to analyze the data and explore potential useful features for our subset. Although the process follows a linear flow in this example, you may notice in practice that you must go back and forth between different steps of the feature exploration and engineering process. </p>



<p class="wp-block-paragraph">First, we will look at the variance of the features in the initial dataset. Machine learning models can only learn from variables that have adequate variance. So, low-variance features are often candidates to exclude from the feature subset.</p>



<p class="wp-block-paragraph">We use the .describe() method to display univariate descriptive statistics about the numerical columns in our dataset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># show statistics for numeric variables
print(df.columns)
df.describe()</pre></div>



<p class="wp-block-paragraph">Next, we check the categorical variables. All variables seem to have a good variance. We can measure the variance with statistical measures or observe it manually using bar charts and scatterplots.</p>



<p class="wp-block-paragraph">We can use histplots to visualize the distributions of the numeric variables. The example below shows the histplot for our target variable sale_price.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Explore the variance of the target variable
variable_name = 'sale_price'
fig, ax = plt.subplots(figsize=(14,5))
sns.histplot(data=df[[variable_name]].dropna(), ax=ax, color='royalblue', kde=True)
ax.get_legend().remove()
ax.set_title(variable_name + ' Distribution')
ax.set_xlim(0, df[variable_name].quantile(0.99))</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9395" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-23-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-23.png" data-orig-size="1051,395" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-23" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-23.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-23-1024x385.png" alt="distribution of the target variable in sale price regression; example for feature exploration and preparation with python and sklearn" class="wp-image-9395" width="695" height="261" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-23.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/image-23.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/image-23.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/image-23.png 1051w" sizes="(max-width: 695px) 100vw, 695px" /></figure>



<p class="wp-block-paragraph">The histplot shows that sale prices are skewed to the left. This means there are many cheap cars and fewer expensive ones, which makes sense.</p>



<p class="wp-block-paragraph">Next, we create bar plots for categorical values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># 3.2 Illustrate the Variance of Numeric Variables 
f_list_numeric = [x for x in df.columns if (is_numeric_dtype(df[x]) and df[x].nunique() &gt; 2)]
f_list_numeric
# box plot design
PROPS = {
    'boxprops':{'facecolor':'none', 'edgecolor':'royalblue'},
    'medianprops':{'color':'coral'},
    'whiskerprops':{'color':'royalblue'},
    'capprops':{'color':'royalblue'}
    }
sns.set_style('ticks', {'axes.edgecolor': 'grey',  
                        'xtick.color': '0',
                        'ytick.color': '0'})
# Adjust plotsize based on the number of features
ncols = 1
nrows = math.ceil(len(f_list_numeric) / ncols)
fig, axs = plt.subplots(nrows, ncols, figsize=(14, nrows*1))
for i, ax in enumerate(fig.axes):
    if i &lt; len(f_list_numeric):
        column_name = f_list_numeric[i]
        sns.boxplot(data=df[column_name], orient=&quot;h&quot;, ax = ax, color='royalblue', flierprops={&quot;marker&quot;: &quot;o&quot;}, **PROPS)
        ax.set(yticklabels=[column_name])
        fig.tight_layout()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9392" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/barplots-to-visualize-the-variance-of-categorical-variables/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png" data-orig-size="1434,425" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="barplots-to-visualize-the-variance-of-categorical-variables" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png" src="https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables-1024x303.png" alt="" class="wp-image-9392" width="786" height="232" srcset="https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/barplots-to-visualize-the-variance-of-categorical-variables.png 1434w" sizes="(max-width: 786px) 100vw, 786px" /></figure>



<p class="wp-block-paragraph">We can observe two things: First, the variance of transmission type is low, as most cars have an automatic transmission. So transmission_type is the first variable that we exclude from our feature subset.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Drop features with low variety
df = df.drop(columns=['transmission_type'])
df.head(2)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	prod_year	maker	model	trim	body_type	state	condition	odometer	ext_color	int_color	sale_price	date
0	2015		Kia		Sorento	LX		SUV			ca		5.0			16639.0		white		black		21500		2014-12-16
1	2015		Nissan	Altima	2.5 S	Sedan		ca		1.0			5554.0		gray		black		10900		2014-12-30</pre></div>



<p class="wp-block-paragraph">Second, int_color and ext_color have many categorical values. By grouping some of these values that hardly ever occur, we can help the model to focus on the most relevant patterns. However, before we do that, we need to take a closer look at how the target variable differs between the categories. </p>



<h3 class="wp-block-heading">Step #4 Bi-variate Analysis</h3>



<p class="wp-block-paragraph">Now that we have a general understanding of our dataset&#8217;s individual variables, let&#8217;s look at pairwise dependencies. We are particularly interested in the relationship between features and the target variables. Our goal is to keep features whose dependence on the target variable shows some pattern &#8211; linear or non-linear. On the other hand, we want to exclude features whose relationship with the target variable looks arbitrary. </p>



<p class="wp-block-paragraph">Visualizations have to take the datatypes of our variables into account. To illustrate the relation between categorical features and the target, we create boxplots and kdeplots. For numeric (continuous) features, we use scatterplots.</p>



<h4 class="wp-block-heading">4.1 Analyzing the Relation between Features and the Target Variable</h4>



<p class="wp-block-paragraph">We begin by taking a closer look at the int_color and ext_color. We use kdeplots to highlight the distribution of prices depending on different colors. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def make_kdeplot(column_name):
    fig, ax = plt.subplots(figsize=(20,8))
    sns.kdeplot(data=df, hue=column_name, x=target_name, ax = ax, linewidth=2,)
    ax.tick_params(axis=&quot;x&quot;, rotation=90, labelsize=10, length=0)
    ax.set_title(column_name)
    ax.set_xlim(0, df[target_name].quantile(0.99))
    plt.show()
    
make_kdeplot('ext_color')
</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9418" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/output-2-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/output-2.png" data-orig-size="1168,507" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output-2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/output-2.png" src="https://www.relataly.com/wp-content/uploads/2022/09/output-2-1024x444.png" alt="Density plots are useful during feature exloration and selection" class="wp-image-9418" width="637" height="275" srcset="https://www.relataly.com/wp-content/uploads/2022/09/output-2.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/output-2.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/output-2.png 768w" sizes="(max-width: 637px) 100vw, 637px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">make_kdeplot('int_color')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9419" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/output2-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/output2.png" data-orig-size="1168,507" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/output2.png" src="https://www.relataly.com/wp-content/uploads/2022/09/output2-1024x444.png" alt="Another density plot that shows the distribution of colors across our car dataset" class="wp-image-9419" width="655" height="283" srcset="https://www.relataly.com/wp-content/uploads/2022/09/output2.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/output2.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/output2.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/output2.png 1168w" sizes="(max-width: 655px) 100vw, 655px" /></figure>



<p class="wp-block-paragraph">In both cases, a few colors are prevalent and account for most observations. Moreover, distributions of the car price differ for these prevalent colors. These differences look promising as they may help our model to differentiate cheaper cars from more expensive ones. To simplify things, we group the colors that hardly occur into a color category called &#8220;other.&#8221;</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Binning features
df['int_color'] = [x if  x in(['black', 'gray', 'white', 'silver', 'blue', 'red']) else 'other' for x in df['int_color']]
df['ext_color'] = [x if  x in(['black', 'gray', 'white', 'silver', 'blue', 'red']) else 'other' for x in df['ext_color']]</pre></div>



<p class="wp-block-paragraph">Next, we create plots for all remaining features. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Vizualising Distributions
f_list = [x for x in df.columns if ((is_numeric_dtype(df[x])) and x != target_name) or (df[x].nunique() &lt; 50)]
f_list_len = len(f_list)
print(f'numeric features: {f_list_len}')
# Adjust plotsize based on the number of features
ncols = 1
nrows = math.ceil(f_list_len / ncols)
fig, axs = plt.subplots(nrows, ncols, figsize=(18, nrows*5))
for i, ax in enumerate(fig.axes):
    if i &lt; f_list_len:
        column_name = f_list[i]
        print(column_name)
        # If a variable has more than 8 unique values draw a scatterplot, else draw a violinplot 
        if df[column_name].nunique() &gt; 100 and is_numeric_dtype(df[column_name]):
            # Draw a scatterplot for each variable and target_name
            sns.scatterplot(data=df, y=target_name, x=column_name, ax = ax)
        else: 
            # Draw a vertical violinplot (or boxplot) grouped by a categorical variable:
            myorder = df.groupby(by=[column_name])[target_name].median().sort_values().index
            sns.boxplot(data=df, x=column_name, y=target_name, ax = ax, order=myorder)
            #sns.violinplot(data=df, x=column_name, y=target_name, ax = ax, order=myorder)
        ax.tick_params(axis=&quot;x&quot;, rotation=90, labelsize=10, length=0)
        ax.set_title(column_name)
    fig.tight_layout()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="9397" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/boxplots/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png" data-orig-size="1289,2153" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="boxplots" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png" src="https://www.relataly.com/wp-content/uploads/2022/09/boxplots-613x1024.png" alt="boxplots and scatterplots help us to understand the relationship between our features and the target variable" class="wp-image-9397" width="725" height="1211" srcset="https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 613w, https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 180w, https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 920w, https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 1226w, https://www.relataly.com/wp-content/uploads/2022/09/boxplots.png 1289w" sizes="(max-width: 725px) 100vw, 725px" /></figure>



<p class="wp-block-paragraph">Again, for categorical variables, we want to see differences in the distribution of the categories. Based on the boxplot&#8217;s median and the quantiles, we can denote that prod_year, int_color, and condition show adequate variance. The scatterplot for the odometer value also looks good. So we want to keep these features. In contrast, the differences between &#8220;state&#8221; and &#8220;ext_color&#8221; are rather weak. Therefore, we exclude these variables from our subset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># drop columns with low variance
df.drop(columns=['state', 'ext_color'], inplace=True)</pre></div>



<p class="wp-block-paragraph">Finally, if you want to take a more detailed look at the numeric features, you can use jointplots. These are scatterplots with additional information about the distributions. The example below shows the jointplot for the odometer value vs price. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># detailed univariate and bivariate analysis of 'odometer' using a jointplot 
def make_jointplot(feature_name):
    p = sns.jointplot(data=df, y=feature_name, x=target_name, height=6, ratio=6, kind='reg', joint_kws={'line_kws':{'color':'coral'}})
    p.fig.suptitle(feature_name + ' Distribution')
    p.ax_joint.collections[0].set_alpha(0.3)
    p.ax_joint.set_ylim(df[feature_name].min(), df[feature_name].max())
    p.fig.tight_layout()
    p.fig.subplots_adjust(top=0.95)
make_jointplot ('odometer')
# Alternatively you can use hex_binning
# def make_joint_hexplot(feature_name):
#     p = sns.jointplot(data=df, y=feature_name, x=target_name, height=10, ratio=1, kind=&quot;hex&quot;)
#     p.ax_joint.set_ylim(0, df[feature_name].quantile(0.999))
#     p.ax_joint.set_xlim(0, df[target_name].quantile(0.999))
#     p.fig.suptitle(feature_name + ' Distribution')</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11491" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-8-10/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-8.png" data-orig-size="425,427" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-8" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-8.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-8.png" alt="" class="wp-image-11491" width="499" height="502" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-8.png 425w, https://www.relataly.com/wp-content/uploads/2022/12/image-8.png 140w" sizes="(max-width: 499px) 100vw, 499px" /></figure>



<p class="wp-block-paragraph">Here is another example of a jointplot for the variable &#8216;condition.&#8217;</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># detailed univariate and bivariate analysis of 'condition' using a jointplot 
make_jointplot('condition')</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9423" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/jointplot-condition/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/jointplot-condition.png" data-orig-size="425,427" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="jointplot-condition" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/jointplot-condition.png" src="https://www.relataly.com/wp-content/uploads/2022/09/jointplot-condition.png" alt="Dotplot that shows the relationship between two variables: car condition vs sale price" class="wp-image-9423" width="472" height="475" srcset="https://www.relataly.com/wp-content/uploads/2022/09/jointplot-condition.png 425w, https://www.relataly.com/wp-content/uploads/2022/09/jointplot-condition.png 150w" sizes="(max-width: 472px) 100vw, 472px" /></figure>



<p class="wp-block-paragraph">The graphs show a linear relationship between the price for the condition and the odometer value. </p>



<h4 class="wp-block-heading" id="h-4-2-correlation-matrix">4.2 Correlation Matrix</h4>



<p class="wp-block-paragraph">Correlation analysis is a technique to quantify the dependency between numeric features and a target variable. Different ways exist to calculate the correlation coefficient. For example, we can use Pearson correlation (linear relation), Kendall correlation (ordinal association), or Spearman (monotonic dependence). </p>



<p class="wp-block-paragraph">The example below uses Pearson correlation, which concentrates on the linear relationship between two variables. The Pearson correlation score lies between -1 and 1. General interpretations of the absolute value of the correlation coefficient&nbsp;are:</p>



<ul class="wp-block-list">
<li>.00-.19 &#8220;very weak&#8221;</li>



<li>.20-.39 &#8220;weak&#8221;</li>



<li>.40-.59 &#8220;moderate&#8221;</li>



<li>.60-.79 &#8220;strong&#8221;</li>



<li>.80-1.0 &#8220;very strong&#8221;</li>
</ul>



<p class="wp-block-paragraph">More information on the Pearson correlation can be found <a href="https://www.relataly.com/category/data-science/pearson-correlation/" target="_blank" rel="noreferrer noopener">here</a> and in <a href="https://www.relataly.com/stock-market-correlation-matrix-in-python/103/" target="_blank" rel="noreferrer noopener">this article on the correlation between covid-19 and the stock market</a>.</p>



<p class="wp-block-paragraph">We will calculate a correlation matrix that provides the correlation coefficient for all features in our subset, incl. sale_price.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># 4.1 Correlation Matrix
# correlation heatmap allows us to identify highly correlated explanatory variables and reduce collinearity
plt.figure(figsize = (9,8))
plt.yticks(rotation=0)
correlation = df.corr()
ax =  sns.heatmap(correlation, cmap='GnBu',square=True, linewidths=.1, cbar_kws={&quot;shrink&quot;: .82},annot=True,
            fmt='.1',annot_kws={&quot;size&quot;:10})
sns.set(font_scale=0.8)
for f in ax.texts:
        f.set_text(f.get_text())  </pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9400" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-24-9/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-24.png" data-orig-size="646,549" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-24" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-24.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-24.png" alt="Heatmap in Python that shows the correlation between selected variables in our car dataset" class="wp-image-9400" width="554" height="471" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-24.png 646w, https://www.relataly.com/wp-content/uploads/2022/09/image-24.png 300w" sizes="(max-width: 554px) 100vw, 554px" /></figure>



<p class="wp-block-paragraph">All our remaining numeric features strongly correlate with price (positive or negative). However, this is not all that matters. Ideally, we want to have features that have a low correlation with each other. We can see that prod_year and condition are moderately correlated (coefficient: 0.5). Because prod_year is more correlated with price (coefficient: 0.6) than condition (coefficient: 0.5), we drop the condition variable. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">df.drop(columns='condition', inplace=True)</pre></div>



<h3 class="wp-block-heading">Step #5 Data Preprocessing </h3>



<p class="wp-block-paragraph">Now our subset contains the following variables:</p>



<ul class="wp-block-list">
<li>prod_year</li>



<li>maker</li>



<li>model</li>



<li>trim</li>



<li>body_type</li>



<li>odometer</li>



<li>int_color</li>



<li>sale_price</li>
</ul>



<p class="wp-block-paragraph">Next, we prepare the data for use as input to train a regression model. Before we train the model, we need to make a few final preparations. For example, we use a label encoder to replace the strong_values of the categorical variables with numeric values.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># encode categorical variables 
def encode_categorical_variables(df):
    # create a list of categorical variables that we want to encode
    categorical_list = [x for x in df.columns if is_string_dtype(df[x])]
    le = LabelEncoder()
    # apply the encoding to the categorical variables
    # because the apply() function has no inplace argument,  we use the following syntax to transform the df
    df[categorical_list] = df[categorical_list].apply(LabelEncoder().fit_transform)
    return df
df_final_subset = encode_categorical_variables(df)
df_all_ = encode_categorical_variables(df_all)
# create a copy of the dataframe but without the target variable
df_without_target = df.drop(columns=[target_name])
df_final_subset.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	prod_year	maker	model	trim	body_type	odometer	int_color	sale_price	date
0	2015		23		594		794		31			16639.0		0			21500		8
1	2015		34		59		98		32			5554.0		0			10900		17
2	2014		2		46		180		32			14414.0		0			49750		8
3	2015		34		59		98		32			11398.0		0			14100		13
4	2015		7		325		789		32			14538.0		0			7200		158</pre></div>



<h3 class="wp-block-heading" id="h-step-6-splitting-the-data-and-training-the-model">Step #6 Splitting the Data and Training the Model</h3>



<p class="wp-block-paragraph">To ensure that our regression model does not know the target variable, we separate car price (y) from features (x). Last, we split the data into separate datasets for training and testing. The result is four different data sets: x_train, y_train, x_test, and y_test.</p>



<p class="wp-block-paragraph">Once the split function has prepared the datasets, we the regression model. Our model uses the Random Decision Forest algorithm from the scikit learn package. As a so-called ensemble model, the Random Forest is a robust Machine Learning algorithm. It considers predictions from a set of multiple independent estimators. </p>



<p class="wp-block-paragraph">The Random Forest algorithm has a wide range of hyperparameters. While we could optimize our model further by testing various configurations (hyperparameter tuning), this is not the focus of this article. Therefore, we will use the default hyperparameters for our model as defined by <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=random%20forest#sklearn.ensemble.RandomForestClassifier" target="_blank" rel="noreferrer noopener">scikit-learn</a>. Please visit one of my recent articles on <a href="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/" target="_blank" rel="noreferrer noopener">hyperparameter tuning</a>, if you want to learn more about this topic.</p>



<p class="wp-block-paragraph">For comparison reasons, we train two models—one model with our subset of selected features. The second model uses all features, cleansed but without any further manipulations. </p>



<p class="wp-block-paragraph">We use shuffled cross-validation (cv=5) to evaluate our model&#8217;s performance on different data folds.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def splitting(df, name):
    # separate labels from training data
    X = df.drop(columns=[target_name])
    y = df[target_name] #Prediction label
    # split the data into x_train and y_train data sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)
    # print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
    print(name + '')
    print('train: ', X_train.shape, y_train.shape)
    print('test: ', X_test.shape, y_test.shape)
    return X, y, X_train, X_test, y_train, y_test
# train the model
def train_model(X, y, X_train, y_train):
    estimator = RandomForestRegressor() 
    cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
    scores = cross_val_score(estimator, X, y, cv=cv)
    estimator.fit(X_train, y_train)
    return scores, estimator
# train the model with the subset of selected features
X_sub, y_sub, X_train_sub, X_test_sub, y_train_sub, y_test_sub = splitting(df_final_subset, 'subset')
scores_sub, estimator_sub = train_model(X_sub, y_sub, X_train_sub, y_train_sub)
    
# train the model with all features
X_all, y_all, X_train_all, X_test_all, y_train_all, y_test_all = splitting(df_all_, 'fullset')
scores_all, estimator_all = train_model(X_all, y_all, X_train_all, y_train_all)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">subset
train:  (76592, 8) (76592,)
test:  (32826, 8) (32826,)</pre></div>



<h3 class="wp-block-heading" id="h-step-7-comparing-regression-models">Step #7 Comparing Regression Models</h3>



<p class="wp-block-paragraph">Finally, we want to see how the model performs and how its performance compares against the model that uses all variables. </p>



<h4 class="wp-block-heading" id="h-7-1-model-scoring">7.1 Model Scoring</h4>



<p class="wp-block-paragraph">We use different regression metrics to measure the performance. Then we create a barplot that compares the performance scores across the different validation folds (due to cross-validation). </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># 7.1 Model Scoring 
def create_metrics(scores, estimator, X_test, y_test, col_name):
    scores_df = pd.DataFrame({col_name:scores})
    # predict on the test set
    y_pred = estimator.predict(X_test)
    y_df = pd.DataFrame(y_test)
    y_df['PredictedPrice']=y_pred
    # Mean Absolute Error (MAE)
    MAE = mean_absolute_error(y_test, y_pred)
    print('Mean Absolute Error (MAE): ' + str(np.round(MAE, 2)))
    # Mean Absolute Percentage Error (MAPE)
    MAPE = mean_absolute_percentage_error(y_test, y_pred)
    print('Mean Absolute Percentage Error (MAPE): ' + str(np.round(MAPE*100, 2)) + ' %')
    
    # calculate the feature importance scores
    r = permutation_importance(estimator, X_test, y_test, n_repeats=30, random_state=0)
    data_im = pd.DataFrame(r.importances_mean, columns=['feature_permuation_score'])
    data_im['feature_names'] = X_test.columns
    data_im = data_im.sort_values('feature_permuation_score', ascending=False)
    
    return scores_df, data_im
scores_df_sub, data_im_sub = create_metrics(scores_sub, estimator_sub, X_test_sub, y_test_sub, 'subset')
scores_df_all, data_im_all = create_metrics(scores_all, estimator_all, X_test_all, y_test_all, 'fullset')
scores_df = pd.concat([scores_df_sub, scores_df_all],  axis=1)
# visualize how the two models have performed in each fold
fig, ax = plt.subplots(figsize=(10, 6))
scores_df.plot(y=[&quot;subset&quot;, &quot;fullset&quot;], kind=&quot;bar&quot;, ax=ax)
ax.set_title('Cross validation scores')
ax.set(ylim=(0, 1))
ax.tick_params(axis=&quot;x&quot;, rotation=0, labelsize=10, length=0)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Mean Absolute Error (MAE): 1643.39
Mean Absolute Percentage Error (MAPE): 24.36 %
Mean Absolute Error (MAE): 1813.78
Mean Absolute Percentage Error (MAPE): 25.23 %</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="9436" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/image-29-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/image-29.png" data-orig-size="746,468" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-29" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/image-29.png" src="https://www.relataly.com/wp-content/uploads/2022/09/image-29.png" alt="barplot that visualizes cross validation for a car price regression model" class="wp-image-9436" width="494" height="310" srcset="https://www.relataly.com/wp-content/uploads/2022/09/image-29.png 746w, https://www.relataly.com/wp-content/uploads/2022/09/image-29.png 300w" sizes="(max-width: 494px) 100vw, 494px" /></figure>



<p class="wp-block-paragraph">The subset model achieves an absolute percentage error of around 24%, which is not so bad. But more importantly, our model performs better than the model that uses all features. However, the subset model is less complex as it only uses eight features instead of 12. So it is easier to understand and less costly to train.</p>



<h4 class="wp-block-heading">7.2 Feature Permutation Importance Scores</h4>



<p class="wp-block-paragraph">Next, we calculate feature importance scores. In this way, we can determine which features attribute the most to the predictive power of our model. Feature importance scores are a useful tool in the feature engineering process, as they provide insights into how the features in our subset contribute to the overall performance of our predictive model. Features with low importance scores can be eliminated from the subset or replaced with other features.</p>



<p class="wp-block-paragraph">Again we will compare our subset model to the model that uses all available features from the initial dataset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># compare the feature importance scores of the subset model to the fullset model
fig, axs = plt.subplots(1, 2, figsize=(20, 8))
sns.barplot(data=data_im_sub, y='feature_names', x=&quot;feature_permuation_score&quot;, ax=axs[0])
axs[0].set_title(&quot;Feature importance scores of the subset model&quot;)
sns.barplot(data=data_im_all, y='feature_names', x=&quot;feature_permuation_score&quot;, ax=axs[1])
axs[1].set_title(&quot;Feature importance scores of the fullset model&quot;)</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="421" data-attachment-id="9437" data-permalink="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/cross-validation-scores-1/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png" data-orig-size="1200,493" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="cross-validation-scores-1" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png" src="https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1-1024x421.png" alt="Barplots that compare feature importance between the full dataset model and the subset model" class="wp-image-9437" srcset="https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png 1024w, https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png 300w, https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png 768w, https://www.relataly.com/wp-content/uploads/2022/09/cross-validation-scores-1.png 1200w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">In the subset model, most features are relevant to the model&#8217;s performance. Only date and int_color do not seem to have a significant impact. For the full set model, five out of 12 features hardly contribute to the model performance (date, int_color, ext_color, state, transmission_type). </p>



<p class="wp-block-paragraph">Once you have a strong subset of features, you can automate the feature selection process using different techniques, e.g., forward or backward selection. Automated feature selection techniques will test different model variants with varying feature combinations to determine the best input dataset. This step is often done at the end of the feature engineering process. However, this is something for another article. </p>



<h2 class="wp-block-heading" id="h-conclusions">Conclusions</h2>



<p class="wp-block-paragraph">That&#8217;s it for now! This tutorial has presented an exploratory approach to feature exploration, engineering, and selection. You have gained an overview of tools and graphs that are useful in identifying and preparing features. The second part was a Python hands-on tutorial. We followed an exploratory feature engineering process to build a regression model for car prices. We used various techniques to discover and sort features and make a vital feature subset. These techniques include data cleansing, descriptive statistics, and univariate and bivariate analysis (incl. correlation). We also used some techniques for feature manipulation, including binning. Finally, we compared our subset model to one that uses all available data. </p>



<p class="wp-block-paragraph">If you take away one learning from this article, remember that in machine learning, less is often more. So training classic machine learning models on carefully curated feature subsets likely outperforms models that use all available information. </p>



<p class="wp-block-paragraph">I hope this article was helpful. I am always trying to improve and learn from my audience. So, if you have any questions or suggestions, please write them in the comments. </p>



<h2 class="wp-block-heading" id="h-sources-and-further-reading">Sources and Further Reading</h2>



<ol class="wp-block-list">
<li><a href="https://amzn.to/3eD49Kv" target="_blank" rel="noreferrer noopener">Zheng and Casari (2018) Feature Engineering for Machine Learning</a></li>



<li><a href="https://amzn.to/3TrBdDY" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>



<li><a href="https://amzn.to/3T38bLe" target="_blank" rel="noreferrer noopener">Chip Huyen (2022) Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications</a></li>
</ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<p class="wp-block-paragraph">Stock-market prediction is a typical regression problem. To learn more about feature engineering for stock-market prediction, check out <a href="https://www.relataly.com/feature-engineering-for-multivariate-time-series-models-with-python/1813/" target="_blank" rel="noreferrer noopener">this article on multivariate stock-market forecasting</a>.</p>
<p>The post <a href="https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/">Feature Engineering and Selection for Regression Models with Python and Scikit-learn</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/exploratory-feature-preparation-for-regression-with-python-and-scikit-learn/8832/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">8832</post-id>	</item>
		<item>
		<title>Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python</title>
		<link>https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/</link>
					<comments>https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Mon, 02 May 2022 18:34:02 +0000</pubDate>
				<category><![CDATA[Affinity Propagation (Clustering)]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Coinmarketcap API]]></category>
		<category><![CDATA[Correlation]]></category>
		<category><![CDATA[Covariance]]></category>
		<category><![CDATA[Crypto Exchange APIs]]></category>
		<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Dimensionality Reduction]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Stock Market Forecasting]]></category>
		<category><![CDATA[Time Series Forecasting]]></category>
		<category><![CDATA[Advanced Tutorials]]></category>
		<category><![CDATA[AI in Finance]]></category>
		<category><![CDATA[Cryptocurrencies]]></category>
		<category><![CDATA[Financial Analysis]]></category>
		<category><![CDATA[Stock Market Cluster Analysis]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=8114</guid>

					<description><![CDATA[<p>Affinity propagation is a powerful unsupervised clustering technique that can identify hidden patterns in large datasets. In the cryptocurrency world, where new coins are constantly emerging and prices can be highly volatile, affinity propagation can help investors simplify the chaos. By analyzing historical price data, affinity propagation groups coins into clusters based on their past ... <a title="Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python" class="read-more" href="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/" aria-label="Read more about Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/">Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Affinity propagation is a powerful unsupervised clustering technique that can identify hidden patterns in large datasets. In the cryptocurrency world, where new coins are constantly emerging and prices can be highly volatile, affinity propagation can help investors simplify the chaos.</p>



<p class="wp-block-paragraph">By analyzing historical price data, affinity propagation groups coins into clusters based on their past price fluctuations. Such a cluster analysis enables crypto investors to identify promising entry and exit points, ultimately helping them make smarter investment decisions.</p>



<p class="wp-block-paragraph">To use this technique effectively, it&#8217;s important to understand essential concepts such as covariance, lasso regression, and affinity propagation. Once you understand these concepts, you can apply them to analyze price time series data and identify hidden patterns.</p>



<p class="wp-block-paragraph">Finally, visualizing the results in two and three dimensions can better understand the relationships between coins and their respective clusters. The resulting crypto market map can be a powerful tool for investors to gain insight into the market&#8217;s structure and make informed investment decisions.</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<div class="wp-block-kadence-infobox kt-info-box_317393-a1"><span class="kt-blocks-info-box-link-wrap info-box-link kt-blocks-info-box-media-align-top kt-info-halign-left"><div class="kt-infobox-textcontent"><h2 class="kt-blocks-info-box-title">Disclaimer</h2><p class="kt-blocks-info-box-text">This article does not constitute financial advice. Stock markets can be very volatile and are generally difficult to predict. Predictive models and other forms of analytics applied in this article only serve the purpose of illustrating machine learning use cases.</p></div></span></div>
</div></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading">What is Stock Market Clustering?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Clustering stock markets refers to grouping stocks based on their similarities or common characteristics. This can be done using various clustering algorithms, which analyze the data and assign each stock market to a cluster based on its similarity to other stock markets in the same cluster. In this article, we will run a cluster analysis on historical time series data. This approach involves grouping stocks into clusters based on their historical performance over a certain period of time. </p>



<p class="wp-block-paragraph">Clustering stock market data can be useful for a variety of purposes, such as identifying patterns or trends in the data, comparing the performance of different stocks or sectors, or generating investment recommendations. However, it&#8217;s important to keep in mind that clustering is just one tool among many for analyzing stock market data, and it&#8217;s important to consider a range of factors when making investment decisions. It can also be used to compare the performance of different stock markets and identify potential risks or correlations between them.</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/cryptocurrency-price-charts-with-color-overlay-python/2820/" target="_blank" rel="noreferrer noopener">Color-Coded Cryptocurrency Price Charts in Python</a></p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="509" height="506" data-attachment-id="12694" data-permalink="https://www.relataly.com/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" data-orig-size="509,506" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="neural network machine learning python affinity propagation midjourney relataly crypto-min" data-image-description="&lt;p&gt;neural network machine learning python affinity propagation midjourney relataly crypto-min&lt;/p&gt;
" data-image-caption="&lt;p&gt;neural network machine learning python affinity propagation midjourney relataly crypto-min&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" alt="neural network machine learning python midjourney relataly crypto market map" class="wp-image-12694" srcset="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 509w, https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 140w" sizes="(max-width: 509px) 100vw, 509px" /><figcaption class="wp-element-caption">We can use a crypto market map to illustrate the price correlation between cryptocurrencies.</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What&#8217;s the Problem with Prototype-based Clustering?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Clustering is an unsupervised learning technique that groups similar objects into clusters and separates them from different ones. One of the most popular clustering techniques is <a href="https://www.relataly.com/category/machine-learning-algorithms/k-means/" target="_blank" rel="noreferrer noopener">k-means</a>. K-means belongs to the so-called prototype-based clustering techniques, which divide data points into a predefined number of groups (in the case of k-means, the groups are of equal variance). </p>



<p class="wp-block-paragraph">The prototype-based clustering approach works great if the number of clusters in a dataset is known and the clusters have similar despair. However, when we deal with real-world problems, we often encounter more complex data for which the optimal number of clusters is unknown and difficult or even impossible to guess. In such a case, affinity propagation has a significant advantage because it can automatically estimate the number of clusters. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading">Affinity Propagation: What it is and How it Works </h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">The idea of affinity propagation is to identify clusters by measuring the similarity of data points relative to one another. The algorithm chooses data points as cluster centers that best represent other data points near them. </p>



<p class="wp-block-paragraph">We can imagine the process of identifying these representative data points as an election. Each data point (i) is a voter who casts votes and a candidate (k) who can receive votes from other voters. Votes are a measure of the similarity of data points. A voter who gives many votes to a candidate expresses that this data point is similar to him and therefore is suitable for representing him as a cluster center. The voting process continues until the algorithm reaches a consensus and selects a set number of cluster candidates.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="383" data-attachment-id="8208" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/image-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/image-5.png" data-orig-size="1584,592" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-5" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/image-5.png" src="https://www.relataly.com/wp-content/uploads/2022/05/image-5-1024x383.png" alt="Affinity Propagation: Data points cast votes for candidates and receive votes from other data points " class="wp-image-8208" srcset="https://www.relataly.com/wp-content/uploads/2022/05/image-5.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/image-5.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/image-5.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/image-5.png 1536w, https://www.relataly.com/wp-content/uploads/2022/05/image-5.png 1584w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Affinity Propagation: Data points cast votes for candidates and receive votes from other data points </figcaption></figure>
</div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">The clustering process involves many separate steps (<a href="https://towardsdatascience.com/unsupervised-machine-learning-affinity-propagation-algorithm-explained-d1fef85f22c8" target="_blank" rel="noreferrer noopener">This article</a> provides a detailed description of the steps involved) and works with several matrices: </p>



<ul class="wp-block-list">
<li>The similarity matrix assesses the suitability of data points (candidates) to act as cluster centers.</li>



<li>The availability matrix (or responsibility matrix) collects the support of the data points for the candidates (potential cluster centers) and their suitability to represent them.</li>



<li>The criterion matrix sums up the results and defines the clusters. Data points with equal scores in the criterion matrix are considered part of the same cluster.</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-full"><img decoding="async" width="691" height="334" data-attachment-id="8274" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/image-9/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/image-9.png" data-orig-size="691,334" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-9" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/image-9.png" src="https://www.relataly.com/wp-content/uploads/2022/05/image-9.png" alt="Criterion Matrix: Data Points (Cryptos) with equal numbers are part of the same cluster" class="wp-image-8274" srcset="https://www.relataly.com/wp-content/uploads/2022/05/image-9.png 691w, https://www.relataly.com/wp-content/uploads/2022/05/image-9.png 300w" sizes="(max-width: 691px) 100vw, 691px" /><figcaption class="wp-element-caption">Criterion Matrix: Data Points (Cryptos) with equal numbers are part of the same cluster</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading" id="h-time-series-clustering-using-affinity-propagation-visualizing-cryptocurrency-market-structures-in-python">Time Series Clustering using Affinity Propagation &#8211; Visualizing Cryptocurrency Market Structures in Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Ready to implement affinity propagation in Python to analyze the crypto market structure and create a visual representation of price similarity? Let&#8217;s dive in!</p>



<p class="wp-block-paragraph">First, we define a portfolio of cryptocurrencies and download their historical price quotes from coinmarketcap. We then visualize the time series on separate line charts to ensure that the data has been loaded successfully. After preparing and cleaning the data, we can move on to clustering the cryptocurrencies into groups with similar price movements using Affinity Propagation.</p>



<p class="wp-block-paragraph">Unlike other clustering algorithms, we don&#8217;t set the number of clusters in advance. Instead, we let affinity propagation determine the optimal number of clusters for our portfolio. Finally, we calculate the covariance matrix between clusters and arrange the cryptocurrencies on a 2D map into clusters. We create a network overlay based on covariance to better understand the relationships between different clusters.</p>



<p class="wp-block-paragraph">With affinity propagation, we can identify hidden patterns in the crypto market and group coins into clusters based on their past price fluctuations. This process allows us to identify promising entry and exit points, ultimately helping us make smarter investment decisions. Plus, the 2D map and network overlay help us visualize the relationships between different clusters and coins.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="456" height="509" data-attachment-id="10401" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/image-33-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-33.png" data-orig-size="456,509" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-33" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-33.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-33.png" alt="exemplary price correlation map created with the help of the affinity propagation clustering algorithm, python scikit-learn" class="wp-image-10401" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-33.png 456w, https://www.relataly.com/wp-content/uploads/2022/12/image-33.png 269w" sizes="(max-width: 456px) 100vw, 456px" /><figcaption class="wp-element-caption">We can use affinity propagation to cluster financial assets and visualize them on a map.</figcaption></figure>
</div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The Python code for this tutorial is available in the relataly repository on GitHub.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_03b447-31"><a class="kb-button kt-button button kb-btn_03610b-ed kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/00%20Data%20Visualization/042%20Vizualizing%20Stock%20Market%20Structures%20using%20Cluster%20Analysis%20in%20Python.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_81f956-88 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Before beginning the coding part, ensure that you have set up your Python 3 environment and required packages. Consider <a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda </a>if you don&#8217;t have a Python environment set up yet. To set it up, you can follow the steps in&nbsp;<a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a>. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></li>



<li><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></li>



<li><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></li>



<li><a href="https://seaborn.pydata.org/api.html" target="_blank" rel="noreferrer noopener">Seaborn</a></li>
</ul>



<p class="wp-block-paragraph">Please also make sure you have the <a href="https://pypi.org/project/cryptocmd/" target="_blank" rel="noreferrer noopener">Cmcscaper</a> package installed. We will be using it to download past crypto prices from coinmarketcap.</p>



<p class="wp-block-paragraph">You can install these packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-step-1-load-the-stock-market-data">Step #1: Load the Stock Market Data</h3>



<p class="wp-block-paragraph">We start by loading historical crypto price data from Coinmarketcap. To download the data, we use Cmcscraper, a Python library that allows us to collect Coinmarketcap data without signing up for the official API.</p>



<p class="wp-block-paragraph">The download returns a dataframe with daily price quotes (Close, Open, Avg) for cryptocurrencies between 2016 and today. You can use the dictionary (&#8220;symbol_dict&#8221;) to control which cryptos you want to include in the data. We limit the data we use in our cluster analysis to the last 50 days. In this way, we let the correlation consider earlier price developments. But it&#8217;s up to you to specify a different period. In addition, instead of using absolute price values, we will use daily percentage fluctuations.</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/streaming-crypto-prices-via-the-gate-io-api-with-python/3982/" target="_blank" rel="noreferrer noopener">Requesting Crypto Price Data from the Gate.io REST API in Python</a></p>



<p class="wp-block-paragraph">Loading the data can take several minutes, depending on how many cryptocurrencies we include in the request. So it makes sense not to load the data every time you run the code. Therefore, the code below stores the historical prices in a CSV file. </p>



<p class="wp-block-paragraph">The script will check if the data already exists if you run the code below. If it does, it will use the data from the CSV file. Otherwise, it will load a fresh copy of the data from coinmarketcap.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># A tutorial for this file is available at www.relataly.com
# Tested with Python 3.8.8, Matplotlib 3.5, Scikit-learn 0.24.1, Seaborn 0.11.1, numpy 1.19.5

from cryptocmd import CmcScraper
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns
from sklearn import cluster, covariance, manifold
import requests
import json


#get a dictionary of the top 100 coin symbols and names from an API
def get_symbol_dict():
    url = 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&amp;limit=50&amp;sortBy=market_cap&amp;sortType=desc&amp;convert=USD&amp;cryptoType=all&amp;tagType=all&amp;audited=false'
    response = requests.get(url)
    data = json.loads(response.text)
    df = pd.DataFrame(data['data']['cryptoCurrencyList'])

    # exclude stable coins
    df = df[~df['symbol'].isin(['USDT', 'USDC', 'BUSD', 'DAI', 'TUSD', 'PAX', 'GUSD', 'HUSD', 'USDK', 'USDS', 'USDP', 'USDN', 'USDSB', 'USDX', 'USD++', 'BIDR', 'IDRT', 'VAI', 'BGBP'])]
    df = df[['symbol', 'name']]
    df = df.set_index('symbol')
    df = df.to_dict()
    df = df['name']
    return df

symbol_dict = get_symbol_dict()


# Download historic crypto prices via CmcScraper
def load_fresh_data_and_save_to_disc(symbol_dict, save_path):
    # Extract symbols and names from the symbol_dict
    symbols, names = np.array(sorted(symbol_dict.items())).T
    
    # Initialize an empty DataFrame for storing the prices
    df_crypto = pd.DataFrame()

    # Download and process the price data for each symbol
    for symbol in symbols:
        print(f&quot;Fetching prices for {symbol}...&quot;)
        
        # Download the price data using CmcScraper
        scraper = CmcScraper(symbol)
        df_coin_prices = scraper.get_dataframe()

        # Process the price data and add it to df_crypto
        df = pd.DataFrame({
            f&quot;{symbol}_Open&quot;: df_coin_prices[&quot;Open&quot;],
            f&quot;{symbol}_Close&quot;: df_coin_prices[&quot;Close&quot;],
            f&quot;{symbol}_Avg&quot;: (df_coin_prices[&quot;Close&quot;] + df_coin_prices[&quot;Open&quot;]) / 2,
            f&quot;{symbol}_p&quot;: (df_coin_prices[&quot;Open&quot;] - df_coin_prices[&quot;Close&quot;]) / df_coin_prices[&quot;Open&quot;]
        })
        df_crypto = pd.concat([df_crypto, df], axis=1)

    # Save the price data to a CSV file
    X_df_filtered = df_crypto.filter(like=&quot;_p&quot;)
    X_df_filtered.to_csv(save_path + &quot;historical_crypto_prices.csv&quot;)

    return names, symbols, X_df_filtered
        

# If set to False the data will only be downloaded when you execute the code
# Set to True, if you want a fresh copy of the data.  
fetch_new_data = True 
save_path = '' # path where the price data will be stored in a csv file

# Fetch fresh data via the scraping package, or use data from the csv file on disk
if fetch_new_data == False:
    try:
        print('loading from disk')
        X_df_filtered = pd.read_csv(save_path + 'historical_crypto_prices.csv')
        if 'Unnamed: 0' in X_df_filtered.columns: 
            X_df_filtered = X_df_filtered.drop(['Unnamed: 0'], axis=1)
            symbols, names = np.array(sorted(symbol_dict.items())).T
        print(list(X_df_filtered.columns))
    except:
        print('no existing price data found - loading fresh data from coinmarketcap and saving them to disk')
        names, symbols, X_df_filtered = load_fresh_data_and_save_to_disc(symbol_dict, save_path)
        print(list(symbols))
else:
       print('loading fresh data from coinmarketcap and saving them to disk')
       names, symbols, X_df_filtered = load_fresh_data_and_save_to_disc(symbol_dict, save_path)
       print(list(symbols))

# Limit the price data to the last t days
t= 14 # in days
X_df_filtered = X_df_filtered[:t]
X_df_filtered.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	ACM_p		ADA_p		ARK_p		ATM_p		ATOM_p		AVAX_p		BAT_p		BCH_p		BLZ_p		BNB_p		...	THETA_p		UNI_p		USDT_p		VET_p		WAVES_p		XLM_p		XMR_p		XRP_p		ZIL_p		ZRX_p
0	0.031987	-0.037645	-0.005702	0.030928	-0.005897	-0.012404	-0.012262	-0.022529	0.008072	-0.007111	...	-0.021994	-0.023758	-0.000103	-0.021024	-0.015416	-0.004096	-0.022988	-0.027397	-0.016659	-0.012255
1	0.028192	0.065034	0.122306	0.010310	0.093558	0.106811	0.082863	0.075567	0.062105	0.054733	...	0.067264	0.081040	0.000136	0.077203	0.092987	0.078562	0.111519	0.071696	0.076484	0.085094
2	0.040771	0.016097	-0.133345	0.018963	0.011304	-0.033328	-0.007616	0.011458	-0.019993	0.005134	...	-0.005104	-0.024190	0.000077	0.002218	0.008920	0.004139	-0.031822	-0.012107	-0.003906	-0.021170
3	-0.027698	0.005129	-0.031516	-0.002639	0.022235	-0.008117	0.003969	0.019119	0.015403	0.005920	...	0.007992	0.027203	0.000003	0.000701	0.010739	0.005324	-0.007914	0.007168	0.004556	-0.003786
4	-0.021129	-0.019053	0.003273	-0.008121	0.002883	-0.004927	0.002548	-0.000599	0.028492	-0.012181	...	0.000198	-0.025817	-0.000047	-0.002800	-0.051515	-0.004861	0.015134	-0.000596	-0.010343	0.004530</pre></div>



<p class="wp-block-paragraph">The data looks good, so let&#8217;s continue.</p>



<h3 class="wp-block-heading">Step #2 Plotting Crypto Price Charts</h3>



<p class="wp-block-paragraph">Now that the data is available, we can visualize it in various line graphs. The visualization helps us better understand what kind of data we are dealing with and check if the download was successful.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create Prices Charts for all Cryptocurrencies
list_length = X_df_filtered.shape[1]
ncols = 10
nrows = int(round(list_length / ncols, 0))
height = list_length/3 if list_length &gt; 30 else 4
fig, axs = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, sharey=True, figsize=(20, height))
for i, ax in enumerate(fig.axes):
        if i &lt; list_length:
            sns.lineplot(data=X_df_filtered, x=X_df_filtered.index, y=X_df_filtered.iloc[:, i], ax=ax)
            ax.set_title(X_df_filtered.columns[i])
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8232" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/price-charts-stock-price-prediction/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png" data-orig-size="1176,790" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="price-charts-stock-price-prediction" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png" src="https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction-1024x688.png" alt="Clustering Crypto Market Structures with Affinity Propagation: Daily Price Quotes for different Cryptocurrencies" class="wp-image-8232" width="937" height="629" srcset="https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png 1176w" sizes="(max-width: 937px) 100vw, 937px" /></figure>



<p class="wp-block-paragraph">We can see the lineplots for all cryptocurrencies and everything looks as expected.</p>



<h3 class="wp-block-heading" id="h-step-3-clustering-cryptocurrencies-using-affinity-propagation">Step #3 Clustering Cryptocurrencies using Affinity Propagation</h3>



<p class="wp-block-paragraph">Next, we must prepare the data and run the affinity propagation algorithm. For some cryptocurrencies, we may encounter data that contains NaN values. Because clustering is sensitive to missing values, we must ensure good data quality. In addition, the Python code below will convert the DataFrame into a NumPy array and transpose it into a form where we have crypto assets as records and the days as columns.</p>



<p class="wp-block-paragraph">Running the code below returns a dictionary of clusters with the cryptocurrencies assigned to them by the affinity propagation algorithm.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Drop NaN values
X_df = pd.DataFrame(np.array(X_df_filtered)).dropna()
# Transpose the data to structure prices along columns
X = X_df.copy()
X /= X.std(axis=0)
X = np.array(X)
# Define an edge model based on covariance
edge_model = covariance.GraphicalLassoCV()
# Standardize the time series
edge_model.fit(X)
# Group cryptos to clusters using affinity propagation
# The number of clusters will be determined by the algorithm
cluster_centers_indices , labels = cluster.affinity_propagation(edge_model.covariance_, random_state=1)
cluster_dict = {}
n_labels = labels.max()
print(f&quot;{n_labels} Clusters&quot;)
for i in range(n_labels + 1):
    clusters = ', '.join(names[labels == i])
    print('Cluster %i: %s' % ((i + 1), clusters))
    cluster_dict[i] = (clusters)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">9 Clusters
Cluster 1: Binance Coin, Cake Defi
Cluster 2: Bitcoin Cash, Bitcoin, BitTorrent, Decred, EOS, Ethereum Classic, Ethereum, Ampleforth, Komodo, Solana, Sys Coin, DOT
Cluster 3: Celsius
Cluster 4: Doge Coin
Cluster 5: Cardano, ATOM, Avalance, Enjin, Internet Computer, Link, Loopring, Polygon, IOTA, NEO, Synthetix, Theta, Vechain
Cluster 6: Litecoin
Cluster 7: ACM Token, Atletico Madrid Token, Chilliz, Juventus Turin Token, PSG Token
Cluster 8: LRC
Cluster 9: Tether
Cluster 10: ARK, Battoken, BLZ, Digibyte, AS Rom Token, WAVES, Stellar Lumen, Monero, Ripple, Zilliqa, Zer0</pre></div>



<p class="wp-block-paragraph">We can see that the algorithm has identified 13 different clusters in the data and a couple of clusters with only a single member. You will most likely encounter different results depending on when you run it. </p>



<h3 class="wp-block-heading">Step #4 Create a 2D Positioning Model based on the Graph Structure</h3>



<p class="wp-block-paragraph">In addition to clusters, we want to show the covariance between cryptocurrencies in our Crypto Market map. We need a graph-like structure that contains the covariance and position data of the cryptocurrencies for each crypto pair.</p>



<p class="wp-block-paragraph">In addition, we use a node position model that calculates their relative position on a 2D plane from the covariance of the cryptocurrencies. However, the positions are only relative, so the absolute axes have no meaning.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create a node_position_model that find the best position of the cryptos on a 2D plane
# The number of components defines the dimensions in which the nodes will be positioned
node_position_model = manifold.LocallyLinearEmbedding(n_components=2, eigen_solver='dense', n_neighbors=20)
embedding = node_position_model.fit_transform(X.T).T
# The result are x and y coordindates for all cryptocurrencies
pd.DataFrame(embedding)
# Create an edge_model that represents the partial correlations between the nodes
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
# Only consider partial correlations above a specific threshold (0.02)
non_zero = (np.abs(np.triu(partial_correlations, k=1)) &gt; 0.02)
# Convert the Positioning Model into a DataFrame
data = pd.DataFrame.from_dict({&quot;embedding_x&quot;:embedding[0],&quot;embedding_y&quot;:embedding[1]})
# Add the labels to the 2D positioning model
data[&quot;labels&quot;] = labels
print(data.shape)
data.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">(48, 3)
	embedding_x	embedding_y	labels
0	0.400590	-0.136473	6
1	-0.081908	-0.086039	4
2	-0.033982	-0.038526	9
3	0.416745	0.076849	6
4	-0.041938	0.031966	4</pre></div>



<p class="wp-block-paragraph">The next step is to create a graph of the partial correlations. </p>



<h3 class="wp-block-heading">Step #5 Visualize the Crypto Market Structure</h3>



<p class="wp-block-paragraph">Our goal is to visualize differences in the covariance between crypto pairs by varying the connection strengths. We calculate the line strength by normalizing the covariance of the crypto pairs. In addition, we visualize the distribution of the covariance. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create an array with the segments for connecting the data points
start_idx, end_idx = np.where(non_zero) 
segments = [[np.array([embedding[:, start], embedding[:, stop]]).T, start, stop] for start, stop in zip(start_idx, end_idx)]
# Create a normalized representation of partial correlation between crypto currencies
# We can later use covariance to vizualize the strength of the connections
pc = np.abs(partial_correlations[non_zero])
normalized = (pc-min(pc))/(max(pc)-min(pc))
# plot the distribution of covariance between the cryptocurrencies
sns.histplot(pc)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="8251" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/covariance-histogram/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/covariance-histogram.png" data-orig-size="382,248" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="covariance-histogram" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/covariance-histogram.png" src="https://www.relataly.com/wp-content/uploads/2022/05/covariance-histogram.png" alt="cryptocurrency market structure visualization with affinity propagation, histogram of covariance between historical crypto prices" class="wp-image-8251" width="540" height="351" srcset="https://www.relataly.com/wp-content/uploads/2022/05/covariance-histogram.png 382w, https://www.relataly.com/wp-content/uploads/2022/05/covariance-histogram.png 300w" sizes="(max-width: 540px) 100vw, 540px" /></figure>



<p class="wp-block-paragraph">The hist plot shows that the covariance between the crypto pairs is mostly below 0.005.</p>



<p class="wp-block-paragraph">Finally, it is time to map cryptocurrencies on a 2D plane. To do this, we first define the cryptocurrencies using their relative position data with a scatterplot. We set the color of the points based on their clusters so that points in the same cluster are colored the same. Subsequently, we connect the points to the data from the edge model. The covariance between the crypto pairs determines the strength of their connections.</p>



<p class="wp-block-paragraph">We also define the color of the connections as follows. </p>



<ul class="wp-block-list">
<li>The map only shows connections with a covariance greater than 0.002.</li>



<li>Connections with a covariance greater than 0.05 are colored red. </li>



<li>Otherwise, connections between points within a cluster are shown in the cluster&#8217;s color. </li>



<li>We color connections in grey that are between points of different clusters.</li>
</ul>



<p class="wp-block-paragraph">Last but not least, we add the labels of the cryptocurrencies.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Visualization
plt.figure(1, facecolor='w', figsize=(20, 8))
plt.clf()
ax = plt.axes([0., 0., 1., 1.])

# Plot the nodes using the coordinates of our embedding
sc = sns.scatterplot(
    data=data,
    x=&quot;embedding_x&quot;,
    y=&quot;embedding_y&quot;,
    zorder=1,
    s=350 * d ** 2,
    c=labels,
    cmap=plt.cm.nipy_spectral,
    alpha=.9,
    #palette=&quot;muted&quot;,
)

# Plot the covariance edges between the nodes (scatter points)
line_strength = 3.2
    
for index, ((x, y), start, stop) in enumerate(segments):     
    norm_partial_correlation = normalized[index]
    if list(data.iloc[[start]]['labels'])[0] == list(data.iloc[[stop]]['labels'])[0]:
        if norm_partial_correlation &gt; 0.5:
            color = 'red'; linestyle='solid'
        else:
            color = plt.cm.nipy_spectral(list(data.iloc[[start]]['labels'])[0] / float(n_labels)); linestyle='solid'
    else:
        if norm_partial_correlation &gt; 0.5:
            color = 'red'; linestyle='solid'
        else:
            color = 'grey'; linestyle='dashed'
    # Plot the edges
    # if x and y larger than 0
    if x[0] &gt; 0 and y[0] &gt; 0:
        plt.plot(x, y, alpha=.4, zorder=0, linewidth=normalized[index]*line_strength, color=color, linestyle=linestyle)

    
# Labels the nodes and position the labels to avoid overlap with other labels
for name, label, (x, y) in zip(names, labels, embedding.T):
    color = plt.cm.nipy_spectral(label / float(n_labels))
    ax.annotate(
        name,
        xy=(x, y),
        xytext=(5, 2),
        textcoords='offset points',
        ha='right',
        va='bottom',
        fontsize=10,
        color='black',
        bbox=dict(facecolor='w', edgecolor=&quot;w&quot;, alpha=.0),
     )</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8229" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/cryptocurrency-map/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png" data-orig-size="1473,590" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="cryptocurrency-map" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png" src="https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map-1024x410.png" alt="Visualizing the Crypto Market Structure - Clusters of Cryptocurrencies determined by Affinity Propagation, Connections between cryptocurrencies defined by partial covariance in daily price fluctuations" class="wp-image-8229" width="1177" height="471" srcset="https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png 1473w" sizes="(max-width: 1177px) 100vw, 1177px" /></figure>



<p class="wp-block-paragraph">Note that you will likely see a different map when you run the code on your machine. Differences result from changes in market prices and covariance that lead to other graph structures. </p>



<p class="wp-block-paragraph">Let&#8217;s see what the crypto market map tells us.</p>



<h4 class="wp-block-heading">Interpreting the Cryptomarket Map</h4>



<p class="wp-block-paragraph">The 2D crypto market map tells us several things:</p>



<ul class="wp-block-list">
<li>Most cryptos fall into the light green and dark green clusters corresponding to different types of crypto (Decentralized Finance Coins, NFT/Metaverse Coins).</li>



<li>There is a significant covariance between large-cap players in the crypto space, such as Cardano and Loopring and Ethereum and Bitcoin, which is plausible considering recent price movements. Some results are surprising, for example, the partial correlation between NEO and Ethereum Classic. </li>



<li>Some clusters are isolated and contain only a single member, for example, Tether, Komodo, AC Milan token, Wave token, and Dogecoin). The reason is that the prices of these coins/tokens have developed independently of the market.<ul><li> Tether is a stablecoin that does not change in price. It, therefore, strongly differs from the other cryptocurrencies on our map. </li></ul>
<ul class="wp-block-list">
<li>Komodo has been trading sideways without following the general market trend. </li>



<li>And the MCM token is a soccer token that has recently outperformed the market.</li>
</ul>
</li>



<li>Soccer tokens are colored in dark blue. These tokens&#8217; prices correlate with how the soccer clubs performed during the current season. It, therefore, makes perfect sense that these tokens are grouped into a cluster. An exception is the AC Milan token, which recently performed better than the other soccer tokens.</li>
</ul>



<h3 class="wp-block-heading">Step #6 Creating a 3D Representation</h3>



<p class="wp-block-paragraph">Instead of a 2D representation of the data points, we can also use a 3D node positioning model. For this purpose, the node positioning model distributes the affinity values over three dimensions.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Find the best position of the cryptos on a 3D plane
node_position_model = manifold.LocallyLinearEmbedding(n_components=3, eigen_solver='dense', n_neighbors=20)
embedding = node_position_model.fit_transform(X.T).T
# The result are x and y coordindates for all cryptocurrencies
pd.DataFrame(embedding)
# Display a graph of the partial correlations
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
non_zero = (np.abs(np.triu(partial_correlations, k=1)) &gt; 0.02)
data = pd.DataFrame.from_dict({&quot;embedding_x&quot;:embedding[0],&quot;embedding_y&quot;:embedding[1],&quot;embedding_z&quot;:embedding[1]})
data[&quot;labels&quot;] = labels
data[&quot;names&quot;] = names
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(figsize=(20,20))
ax = fig.add_subplot(projection='3d')
xs = data[&quot;embedding_x&quot;]
ys = data[&quot;embedding_y&quot;]
zs = data[&quot;embedding_z&quot;]
sc = ax.scatter(xs, ys, zs, c=labels, s=100)
    
for i in range(len(data)):
    x = xs[i]
    y = ys[i]
    z = zs[i]
    label = data[&quot;names&quot;][i]
    ax.text(x, y, z, label)
    
plt.legend(*sc.legend_elements(), bbox_to_anchor=(1.05, 1), loc=2)
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8243" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/3d-cluster-representation-affinity-propagation/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png" data-orig-size="1210,1101" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="3d-cluster-representation-affinity-propagation" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png" src="https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation-1024x932.png" alt="3d representation of the cryptocurrency market structure, affinity propagation relataly" class="wp-image-8243" width="891" height="811" srcset="https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png 1210w" sizes="(max-width: 891px) 100vw, 891px" /></figure>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Affinity propagation is a powerful technique for clustering items when the optimal number of clusters is unknown. In this article, we&#8217;ve demonstrated how to apply affinity propagation to analyze the cryptocurrency market and identify groups of assets based on similar price fluctuations.</p>



<p class="wp-block-paragraph">In our example, we identified 13 groups of cryptocurrencies without specifying the number of clusters in advance. We also visualized the market structure on a 2D and 3D map using a node distribution technique. This approach can be extended to analyze and cluster stock markets, highlighting complex price patterns among multiple financial assets.</p>



<p class="wp-block-paragraph">Once you&#8217;ve identified clusters, you can dive deeper into individual groups. Sometimes, outliers that temporarily break out of their usual pattern indicate interesting investment opportunities. These outliers can eventually return to the price pattern of their group, or they may represent forerunners of their group, indicating broader market movements.</p>



<p class="wp-block-paragraph">By using affinity propagation, we can visualize financial assets in a new and exciting way. If you have any questions or comments about this approach, please let me know.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading" id="h-sources-and-further-reading">Sources and Further Reading</h2>



<p class="wp-block-paragraph">This article modifies some of the code from <a href="https://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html#sphx-glr-auto-examples-applications-plot-stock-market-py" target="_blank" rel="noreferrer noopener">Scikit-learn and adapts it from the stock market</a> to cryptocurrencies.</p>



<ol class="wp-block-list">
<li><a href="https://amzn.to/3yIQdWi" target="_blank" rel="noreferrer noopener">Jansen (2020) Machine Learning for Algorithmic Trading: Predictive models to extract signals from market and alternative data for systematic trading strategies with Python</a></li>



<li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems </a></li>



<li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>



<li><a href="https://amzn.to/3MAy8j5" target="_blank" rel="noreferrer noopener">Andriy Burkov (2020) Machine Learning Engineering</a></li>



<li>Images are created using Midjourney, an AI that creates images from text.</li>
</ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/">Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">8114</post-id>	</item>
		<item>
		<title>Leveraging Distributed Computing for Weather Analytics with PySpark</title>
		<link>https://www.relataly.com/pyspark-distributed-computing-tutorial-analyzing-zurich-weather-data-with-python/2739/</link>
					<comments>https://www.relataly.com/pyspark-distributed-computing-tutorial-analyzing-zurich-weather-data-with-python/2739/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Sun, 03 Apr 2022 13:49:00 +0000</pubDate>
				<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[PySpark]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Weather Analytics]]></category>
		<category><![CDATA[Advanced Tutorials]]></category>
		<category><![CDATA[Apache Spark]]></category>
		<category><![CDATA[BigData Analytics]]></category>
		<category><![CDATA[Distributed Computing]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=2739</guid>

					<description><![CDATA[<p>Apache Spark is a popular distributed computing framework for Big Data processing and analytics. In this tutorial, we will work hands-on with PySpark, Spark&#8217;s Python-specific interface. We built on the conceptual knowledge gained in a previous tutorial: Introduction to BigData Analytics with Apache Spark, in which we learned about the essential concepts behind Apache Spark ... <a title="Leveraging Distributed Computing for Weather Analytics with PySpark" class="read-more" href="https://www.relataly.com/pyspark-distributed-computing-tutorial-analyzing-zurich-weather-data-with-python/2739/" aria-label="Read more about Leveraging Distributed Computing for Weather Analytics with PySpark">Read more</a></p>
<p>The post <a href="https://www.relataly.com/pyspark-distributed-computing-tutorial-analyzing-zurich-weather-data-with-python/2739/">Leveraging Distributed Computing for Weather Analytics with PySpark</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Apache Spark is a popular distributed computing framework for Big Data processing and analytics. In this tutorial, we will work hands-on with PySpark, Spark&#8217;s Python-specific interface. We built on the conceptual knowledge gained in a previous tutorial: <a href="https://www.relataly.com/introduction-to-bigdata-analytics-with-apache-spark/6324/" target="_blank" rel="noreferrer noopener"><em>Introduction to BigData Analytics with Apache Spark</em></a>, in which we learned about the essential concepts behind Apache Spark and its distributed architecture. </p>



<p class="wp-block-paragraph">The PySpark library provides access to Apache Spark APIs and other cool features like SQL, DataFrame, Streaming, Spark Core, and MLlib for machine learning. Several of these features will help us to prepare and analyze a historical dataset gathered by a weather station in Zurich. You will gain an overview of essential PySpark functions for transforming and querying data in a local computing environment.</p>



<p class="wp-block-paragraph">The rest of this tutorial proceeds as follows: In the first sections, we will ingest, join, clean, and transform the data using a broad set of PySpark&#8217;s DataFrame functions. In addition, we will touch on the topic of user-defined functions to perform custom transformations. Finally, we will analyze and visualize the data using the Seaborn Python library. As part of the analysis, we work with PySpark SQL, which allows us to register datasets as tables and query them using PySpark SQL syntax.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-kadence-image kb-image_a0bd8f-64 size-large is-resized"><img decoding="async" data-attachment-id="6442" data-permalink="https://www.relataly.com/pyspark-distributed-computing-tutorial-analyzing-zurich-weather-data-with-python/2739/image-10-13/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/03/image-10.png" data-orig-size="1474,547" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-10" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/03/image-10.png" src="https://www.relataly.com/wp-content/uploads/2022/03/image-10-1024x380.png" alt="Do plot on historical weather data from Zurich, created using PySpark and Seaborn" class="kb-img wp-image-6442" width="693" height="257" srcset="https://www.relataly.com/wp-content/uploads/2022/03/image-10.png 1024w, https://www.relataly.com/wp-content/uploads/2022/03/image-10.png 300w, https://www.relataly.com/wp-content/uploads/2022/03/image-10.png 768w, https://www.relataly.com/wp-content/uploads/2022/03/image-10.png 1474w" sizes="(max-width: 693px) 100vw, 693px" /><figcaption>We can use PySpark and Seaborn to create dot plots like the one above. </figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What is Weather Analytics?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Weather analytics is the process of using data and analytics to understand and predict weather patterns and their impacts. This can involve analyzing historical weather data to identify trends and patterns, using mathematical models to simulate and forecast future weather conditions, and using machine learning algorithms to make more accurate predictions. Weather analytics can be used for a wide range of applications, such as predicting the likelihood of severe weather events, improving agricultural planning and decision-making, and managing energy consumption and supply. By providing insights into future weather conditions, weather analytics can help businesses and organizations to make more informed decisions and mitigate the potential impacts of extreme weather.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<p class="wp-block-paragraph"></p>



<h2 class="wp-block-heading" id="h-analyzing-historical-weather-data-using-pyspark">Analyzing Historical Weather Data using PySpark</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Weather and climate data are interesting resources, not only for weather forecasting but also for dozens of other purposes across different industries. A central area is certainly understanding climate change. Data scientists analyze climate data collected from weather stations around the globe to forecast how climate trends and events unfold. Because the amounts of data are often large, climate analytics is a common application area for distributed computing.</p>



<p class="wp-block-paragraph">This tutorial guides us through a good use case for distributed data processing. We will process and analyze a set of historical weather data collected from a weather station in Zurich. We aim to explore the relations between climate variables such as temperate, wind, snowfall, and precipitation. However, the primary purpose of this tutorial is to familiarize ourselves with the essential functions of PySpark. Some topics covered in this part are:</p>



<ul class="wp-block-list">
<li>reading and writing DataFrames</li>



<li>selecting, renaming, and manipulating columns</li>



<li>filtering, dropping, sorting, and aggregating rows</li>



<li>joining and transforming data</li>



<li>working with UDFs and Spark SQL functions</li>



<li>While we implement these functions, we will also look into what PySpark does under the hood. </li>
</ul>



<p class="wp-block-paragraph">PySpark can run on a simulated distributed environment on your local machine. So there is no need for expensive hardware.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_03b447-31"><a class="kb-button kt-button button kb-btn_7ddab7-1b kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/09%20Distributed%20Analytics/300%20Distributed%20Computing%20-%20Analyzing%20Zurich%20Weather%20Data%20using%20PySpark.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_236915-75 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="514" height="511" data-attachment-id="12434" data-permalink="https://www.relataly.com/pyspark-distributed-computing-tutorial-analyzing-zurich-weather-data-with-python/2739/view-of-zurich-spark-machine-learning-python-tutorial/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/view-of-zurich-spark-machine-learning-python-tutorial.png" data-orig-size="514,511" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="view-of-zurich-spark-machine-learning-python-tutorial" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/view-of-zurich-spark-machine-learning-python-tutorial.png" src="https://www.relataly.com/wp-content/uploads/2023/02/view-of-zurich-spark-machine-learning-python-tutorial.png" alt="" class="wp-image-12434" srcset="https://www.relataly.com/wp-content/uploads/2023/02/view-of-zurich-spark-machine-learning-python-tutorial.png 514w, https://www.relataly.com/wp-content/uploads/2023/02/view-of-zurich-spark-machine-learning-python-tutorial.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/view-of-zurich-spark-machine-learning-python-tutorial.png 140w" sizes="(max-width: 514px) 100vw, 514px" /><figcaption class="wp-element-caption">Zurich weather knows all seasons.</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">This tutorial assumes that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment. You can follow&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a> to prepare your Anaconda environment if you have not set it up yet. I recommend using Anaconda or Visual Studio Code, but any other Python environment will do.</p>



<p class="wp-block-paragraph">It is also assumed that you have the following packages installed:&nbsp;<em>pandas, <em>matplotlib, and</em><a href="https://seaborn.pydata.org/#" target="_blank" rel="noreferrer noopener"> seaborn</a></em> for visualization. You can install the packages with the console command:&nbsp;<em>pip install &lt;package name&gt;</em>&nbsp;or, if you use the anaconda packet manager,&nbsp;<em>conda install &lt;package name&gt;</em>. In addition, you need to install the PySpark library. </p>



<p class="wp-block-paragraph"><em>pip install &lt;package name&gt;</em> <em>conda </em></p>



<p class="wp-block-paragraph"><em>install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</p>



<h3 class="wp-block-heading" id="h-download-the-weather-data">Download the Weather Data</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">We will train our model with a public weather dataset that contains daily weather information from Zurich, Switzerland. The data was collected at a weather station between 1979-01-01 and 2021-01-01 and has been divided into two sets, &#8220;A&#8221; and &#8220;B.&#8221; After downloading the files, place them under the following directory: <em>root-folder/data/weather/</em></p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="511" height="482" data-attachment-id="12445" data-permalink="https://www.relataly.com/pyspark-distributed-computing-tutorial-analyzing-zurich-weather-data-with-python/2739/weather-analytics-data-machine-learning-python-spark/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/weather-analytics-data-machine-learning-python-spark.png" data-orig-size="511,482" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="weather-analytics-data-machine-learning-python-spark" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/weather-analytics-data-machine-learning-python-spark.png" src="https://www.relataly.com/wp-content/uploads/2023/02/weather-analytics-data-machine-learning-python-spark.png" alt="" class="wp-image-12445" srcset="https://www.relataly.com/wp-content/uploads/2023/02/weather-analytics-data-machine-learning-python-spark.png 511w, https://www.relataly.com/wp-content/uploads/2023/02/weather-analytics-data-machine-learning-python-spark.png 300w" sizes="(max-width: 511px) 100vw, 511px" /><figcaption class="wp-element-caption">We will work with a set of weather data from Zurich. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_9a621c-00"><a class="kb-button kt-button button kb-btn_4b1def-4b kt-btn-size-standard kt-btn-width-type-fixed kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-false wp-block-button__link wp-block-kadence-singlebtn" href="https://www.relataly.com/wp-content/uploads/2022/03/zurich-weather-part-a.csv" download="" target="_blank" rel="noreferrer noopener"><span class="kt-btn-inner-text">Download File A </span></a></div>



<ul class="wp-block-list">
<li>date in format YYYY-mm-dd</li>



<li>the wind direction in degrees</li>



<li>max snow depth in mm</li>



<li>precipitation in mm/m²</li>



<li>air pressure in hPa</li>



<li>max wind speed in km/h</li>



<li>average wind speed in km/h</li>



<li>daily sunshine in minutes</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_e65253-c2"><a class="kb-button kt-button button kb-btn_a200bc-4e kt-btn-size-standard kt-btn-width-type-fixed kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-false wp-block-button__link wp-block-kadence-singlebtn" href="https://www.relataly.com/wp-content/uploads/2022/03/zurich-weather-part-b.csv" download="" target="_blank" rel="noreferrer noopener"><span class="kt-btn-inner-text">Download File B </span></a></div>



<ul class="wp-block-list">
<li>date string in format YYYY-MM-DD</li>



<li>the minimum temperature in °C</li>



<li>maximum temperature in °C</li>



<li>avergage temperature in °C</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow"></div>
</div>



<h3 class="wp-block-heading">Step #1 Initialize SparkSession</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The first step of any Spark application is to create a SparkSession. Older versions of Spark were using contexts for accessing Spark&#8217;s different modules. Since Spark 2.0, SparkSession has replaced SparkContext and has become the unified entry point for accessing Spark APIs and communicating with the cluster.</p>



<p class="wp-block-paragraph">We can create new SparkSessions by using the getOrCreate() method. This method checks if there is an existing&nbsp;<a href="https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession" target="_blank" rel="noreferrer noopener">SparkSession</a>, and if it does not find a current Session, the process creates a new instance. Since we are working with a local PySpark installation, this will make a new SparkSession. If you have a SparkCluster available, you can use this cluster by providing the function with the cluster address.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="6431" data-permalink="https://www.relataly.com/pyspark-distributed-computing-tutorial-analyzing-zurich-weather-data-with-python/2739/picture1/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/03/Picture1.jpg" data-orig-size="611,360" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Picture1" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/03/Picture1.jpg" src="https://www.relataly.com/wp-content/uploads/2022/03/Picture1.jpg" alt="PySpark is the Python-specific interface of Apache Spark" class="wp-image-6431" width="240" height="142" srcset="https://www.relataly.com/wp-content/uploads/2022/03/Picture1.jpg 611w, https://www.relataly.com/wp-content/uploads/2022/03/Picture1.jpg 300w" sizes="(max-width: 240px) 100vw, 240px" /><figcaption class="wp-element-caption">PySpark is the Python-specific interface of Apache Spark</figcaption></figure>
</div>
</div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># A tutorial for this file is available at www.relataly.com

# Import SparkSession from pyspark.sql
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.functions import col, isnan, when, count, udf, year, month, to_date, mean
import pyspark.sql.functions as F
import seaborn as sns
import matplotlib.pyplot as plt

# Create my_spark
spark = SparkSession.builder.getOrCreate()
print(spark)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">&lt;pyspark.sql.session.SparkSession object at 0x000001F8C0D10B20&gt;</pre></div>



<h3 class="wp-block-heading">Step #2 Load the Data to a Spark DataFrame</h3>



<p class="wp-block-paragraph">Next, we want to load two CSV files by invoking the PySpark read function from the SparkSession instance. The function returns a Spark DataFrame, an abstraction layer for interacting with tabular data. </p>



<h4 class="wp-block-heading">2.1 Reading File A</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">Let&#8217;s read the first file by invoking the read method on the dataframe. Spark will evaluate the read function lazily when we run the read function in the code below. So the read operation does not trigger computation and waits until we call an action. </p>



<p class="wp-block-paragraph">After running the printSchema method in the code below, the Spark engine performs the &#8220;read&#8221; operation and shows us the result. We would not see any computations if we only used the read function (a transformation) and not any actions. However, we also call summary() to compute some basic statistics on the DataFrame. These action methods trigger Spark to read the data and create a new RDD that contains the data.</p>



<p class="wp-block-paragraph">Spark supports many formats for reading data, including CSV, JSON, Delta, Parquet, TXT, AVRO, etc. Our data files are in CSV format, separated by a comma. Therefore, we use the .csv function. The code below reads a CSV file into a Spark DataFrame.</p>



<figure class="wp-block-table is-style-regular"><table><thead><tr><th>DataFrame Action</th><th>Description</th></tr></thead><tbody><tr><td>show()</td><td>Displays the top n rows of a DataFrame</td></tr><tr><td>count()</td><td>Counts the number of rows in a DataFrame</td></tr><tr><td>collect()</td><td>Collects the data from the worker nodes and returns them in an array</td></tr><tr><td>summary(), describe()</td><td>Show statistics for string and numeric columns</td></tr><tr><td>first(), head()</td><td>Returns the first row, or several first rows</td></tr><tr><td>take()</td><td>Collects the n first rows and returns them in an array</td></tr></tbody></table><figcaption class="wp-element-caption">Actions Methods on the Spark DataFrame API</figcaption></figure>
</div>
</div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Read File A
spark_weather_df_a = spark.read \
    .option(&quot;header&quot;, False) \
    .option(&quot;sep&quot;, &quot;,&quot;) \
    .option(&quot;inferSchema&quot;, True) \
    .csv(path=f'data/weather/zurich_weather_a')
    
spark_weather_df_a.printSchema()
spark_weather_df_a.describe().show()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">root
 |-- _c0: integer (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: double (nullable = true)
 |-- _c5: double (nullable = true)
 |-- _c6: double (nullable = true)
 |-- _c7: double (nullable = true)
 |-- _c8: double (nullable = true)
 |-- _c9: string (nullable = true)

+-------+------------------+----------+-----------------+------------------+-----------------+-----------------+------------------+------------------+------------------+----+
|summary|               _c0|       _c1|              _c2|               _c3|              _c4|              _c5|               _c6|               _c7|               _c8| _c9|
+-------+------------------+----------+-----------------+------------------+-----------------+-----------------+------------------+------------------+------------------+----+
|  count|             15384|     15384|            15363|             15384|            15243|              885|               897|               897|               897|   0|
|   mean|182.61966978679146|      null|9.552613421857709|3.0395995839833567|7.952896411467559|156.8146892655367| 7.334894091415833|27.502229654403596|1017.9076923076926|null|
| stddev|105.74046684963878|      null|7.476849234845438| 6.527361918363559|31.45995738637225|96.54941299923699|4.8985850971655704|16.091181159085917| 8.236341299054192|null|
|    min|                 0|1979-01-01|            -18.0|               0.0|              0.0|              0.0|               2.2|               7.4|             984.7|null|
|    max|               366|2021-01-01|             28.0|              97.8|            550.0|            358.0|              40.4|             113.0|            1042.8|null|
+-------+------------------+----------+-----------------+------------------+-----------------+-----------------+------------------+------------------+------------------+----+</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" src="https://www.relataly.com/wp-content/uploads/2022/04/image-5-1024x349.png" alt="" class="wp-image-6734"/></figure>



<h4 class="wp-block-heading" id="h-2-2-reading-file-b">2.2 Reading File B</h4>



<p class="wp-block-paragraph">In the code below, you can see the option for header, separate, and inferShema. We are using Spark&#8217;s schema inference option to read the CSV data. However, we must use this function carefully because inferring schema can affect performance. Spark is more sensitive to data types, and inferring the schema does not always work. When we are unsure how the data looks, it is a better solution to make the data types explicit to Spark. In our case, inferSchema has been tested before to ensure it works.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Read File B
spark_weather_df_b = spark.read \
    .option(&quot;header&quot;, False) \
    .option(&quot;sep&quot;, &quot;,&quot;) \
    .option(&quot;inferSchema&quot;, True) \
    .csv(path=f'data/weather/zurich_weather_b')

spark_weather_df_b.printSchema()
spark_weather_df_b.describe().show()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">root
 |-- _c0: integer (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)

+-------+------------------+----------+------------------+-----------------+
|summary|               _c0|       _c1|               _c2|              _c3|
+-------+------------------+----------+------------------+-----------------+
|  count|             15383|     15383|             15374|            15368|
|   mean|182.63121627770917|      null|13.705762976453755| 6.06695731389901|
| stddev|105.73420482794731|      null|  8.83091914731549|6.620985806040894|
|    min|                 0|1979-01-01|             -15.2|            -20.8|
|    max|               366|2021-01-01|              36.0|             22.5|
+-------+------------------+----------+------------------+-----------------+</pre></div>



<h3 class="kt-adv-heading_ab6af5-fa wp-block-kadence-advancedheading" data-kb-block="kb-adv-heading_ab6af5-fa">Step #3 Join the Data</h3>



<p class="wp-block-paragraph">Next, we will join the data from the two files and merge them into a single Spark DataFrame. Whenever we run a transformation on a DataFrame, whether via the SQL or the DataFrame API, we submit a query over to the query plan of the Spark query execution engine. The execution engine will then optimize and execute the query plan. The result is a new Resilient Distributed Dataset (RDD) with the transformed data. Keep this in mind, and imagine what happens underneath the hood as we proceed.</p>



<p class="wp-block-paragraph">We now have two files containing specific columns of our weather data. To make sense of the data and prepare them for analytics, we will merge the two files into a single DataFrame. In addition, we perform several transformations to bring the data into shape.</p>



<ul class="wp-block-list">
<li>Join the two datasets using the Join function</li>



<li>Remove some columns that we won&#8217;t need via the drop functions</li>



<li>Define new column names</li>



<li>Select and rename columns using Select() with Alias()</li>
</ul>



<p class="wp-block-paragraph">First, we drop column_c0 because we will not need it.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Drop unused Columns
spark_weather_df_a = spark_weather_df_a.drop(&quot;_c0&quot;)
spark_weather_df_b = spark_weather_df_b.drop(&quot;_c0&quot;)</pre></div>



<p class="wp-block-paragraph">There are different ways how we can rename columns in PySpark. The first (A) is withColumnRenamed. It requires a separate function call for each column we want to rename. The second is with a custom function and a dictionary. In this way, we can rename multiple columns at once. </p>



<p class="wp-block-paragraph">After renaming the columns, we use the join function to merge the two datasets based on the row column.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># method A: rename individual columns
spark_weather_df_b_renamed = spark_weather_df_b.withColumnRenamed(&quot;_c1&quot;, &quot;date&quot;) \
    .withColumnRenamed(&quot;_c2&quot;, &quot;max_temp&quot;) \
    .withColumnRenamed(&quot;_c3&quot;, &quot;min_temp&quot;) 
        
# method B: rename multiple columns at once   
def rename_multiple_columns(df, columns):
    if isinstance(columns, dict):
        return df.select(*[F.col(col_name).alias(columns.get(col_name, col_name)) for col_name in df.columns])
    else:
        raise ValueError(&quot;columns need to be in dict format {'existing_name_a':'new_name_a', 'existing_name_b':'new_name_b'}&quot;)

dict_columns = {&quot;_c1&quot;: &quot;date2&quot;, 
                &quot;_c2&quot;: &quot;avg_temp&quot;, 
                &quot;_c3&quot;: &quot;precip&quot;,
                &quot;_c4&quot;: &quot;snow&quot;,
                &quot;_c5&quot;: &quot;wind_dir&quot;,
                &quot;_c6&quot;: &quot;wind_speed&quot;,
                &quot;_c7&quot;: &quot;wind_power&quot;,
                &quot;_c8&quot;: &quot;air_pressure&quot;,
                &quot;_c9&quot;: &quot;sunny_hours&quot;,}
spark_weather_df_a_renamed = rename_multiple_columns(spark_weather_df_a , dict_columns)

# Join the dataframes
spark_weather_df = spark_weather_df_a_renamed.join(spark_weather_df_b_renamed, spark_weather_df_a_renamed.date2 == spark_weather_df_b_renamed.date, &quot;inner&quot;)</pre></div>



<h3 class="wp-block-heading" id="h-step-4-gain-a-quick-overview-of-the-data">Step #4 Gain a Quick Overview of the Data</h3>



<p class="wp-block-paragraph">When we perform transformations on a new dataset, we typically want to understand the outcome and print out several functions to gain an overview of the data. We can make our lives easier by writing a small function for this purpose, which we can call as needed. Running the custom function below gives us a quick overview of the data. It takes a DataFrame as an argument and performs the following steps:</p>



<ul class="wp-block-list">
<li>Return the two first records </li>



<li>Counting NAN values for all columns that have double or float datatype</li>



<li>Print duplicate values based on the date column. To display the result, we can use the Limit function in combination with the toPandas function</li>



<li>Prints the schema</li>
</ul>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def quick_overview(df):
   # display the spark dataframe
   print(&quot;FIRST RECORDS&quot;)
   print(df.limit(2).sort(col(&quot;date&quot;), ascending=True).toPandas())

   # count null values
   print(&quot;COUNT NULL VALUES&quot;)
   print(df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c, y in df.dtypes if y in [&quot;double&quot;, &quot;float&quot;]]
      ).toPandas())

   # print(&quot;DESCRIBE STATISTICS&quot;)
   # print(df.describe().toPandas())
   # # Alternatively to get the max value, we could use max_value = df.agg({&quot;precipitation&quot;: &quot;max&quot;}).collect()[0][0]

   # check for dublicates
   dublicates = df.groupby(spark_weather_df.date) \
    .count() \
    .where('count &gt; 1') \
    .limit(5).toPandas()
   print(dublicates)

   # print schema
   print(&quot;PRINT SCHEMA&quot;)
   print(df.printSchema())

quick_overview(spark_weather_df)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">FIRST RECORDS
        date2  avg_temp  precip  snow  wind_dir  wind_speed  wind_power  \
0  2020-01-01      -1.8     0.0   0.0      63.0         4.2        14.8   
1  2020-01-01      -1.8     0.0   0.0      63.0         4.2        14.8   

   air_pressure sunny_hours        date  max_temp  min_temp  
0        1034.6        None  2020-01-01      -0.9      -2.9  
1        1034.6        None  2020-01-01      -0.9      -2.9  
COUNT NULL VALUES
   avg_temp  precip  snow  wind_dir  wind_speed  wind_power  air_pressure  \
0        21       0   143     14577       14565       14565         14565   

   max_temp  min_temp  
0         9        15  
         date  count
0  1986-01-01      4
1  2011-01-01      4
2  2005-01-01      4
3  1985-01-01      4
4  2006-01-01      4
PRINT SCHEMA
root
 |-- date2: string (nullable = true)
 |-- avg_temp: double (nullable = true)
 |-- precip: double (nullable = true)
...
 |-- max_temp: double (nullable = true)
 |-- min_temp: double (nullable = true)

None</pre></div>



<p class="wp-block-paragraph">As we can see, our data has duplicates and missing values. But don&#8217;t worry; we will handle these data quality issues in the next step.</p>



<h3 class="wp-block-heading">Step #5 Clean the Data using Chained Operations</h3>



<p class="wp-block-paragraph">Spark uses a design pattern that allows us to chain multiple operations. In the next step, we use this feature when we clean and filter the data. On execution, Spark will decide the order in which it performs the transformations as part of an optimized execution plan.</p>



<p class="wp-block-paragraph">We begin cleaning the data by replacing Null values. Afterward, we will reduce the data size by selecting specific columns in a separate section.</p>



<h4 class="wp-block-heading">5.1 Replacing NA Values with Means</h4>



<p class="wp-block-paragraph">As we saw in the previous step, there are several NULL values in the temperature and wind columns. Having complete datasets without gaps is preferable when performing historical data analytics. Therefore, we will replace the missing values with the average value. Running the code below, first, calculate the mean values. Subsequently, we use these mean values to replace the Null values. In the second step, we can utilize Spark&#8217;s method-chaining functionality and perform the replace action for several columns in a single line of code. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># calculate mean values
avg = spark_weather_df.filter(spark_weather_df.avg_temp.isNotNull())\
    .select(mean(col('min_temp')).alias('mean_min'), 
            mean(col('max_temp')).alias('mean_max'), 
            mean(col('wind_speed')).alias('mean_wind')).collect()

mean_min = avg[0]['mean_min']
mean_max = avg[0]['mean_max']
mean_wind = avg[0]['mean_wind']

# replace na values with mean values
spark_weather_df = spark_weather_df \
    .na.fill(value=mean_min, subset=[&quot;min_temp&quot;]) \
    .na.fill(value=mean_max, subset=[&quot;max_temp&quot;]) \
    .na.fill(value=mean_wind, subset=[&quot;wind_speed&quot;]) \
    .na.fill(value=0, subset=[&quot;snow&quot;]) 
    
</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Empty DataFrame Columns: [date, count] Index: []</pre></div>



<h4 class="wp-block-heading">5.2 Removing Duplicates</h4>



<p class="wp-block-paragraph">We still need to remove duplicate values now that we have a complete dataset. In addition, we will use the following DataFrame methods:</p>



<ul class="wp-block-list">
<li><strong>orderBy</strong>: To sort the data.</li>



<li><b>With column</b>: Select columns and convert data types (to Date type). We can find an overview of the data types supported by PySpark here.</li>



<li><strong>drop</strong>: Used to eliminate individual columns from a dataset.</li>



<li><strong>dropDublicates</strong>: Used to eliminate duplicate records from the dataset.</li>
</ul>



<p class="wp-block-paragraph">Again, we will use PySpark&#8217;s method chaining feature and invoke all transformations in a single line of code.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># remove dublicates and drop column date2, convert date to datatype &quot;date&quot;, Sort the Data by Date
# drop date2 column, convert date column from string to date type, order by date
spark_cleaned_df = spark_weather_df.dropDuplicates()\
    .drop(col(&quot;date2&quot;))\ # only for demonstration, actually not necessary because we are using select at the end
    .withColumn(&quot;date&quot;, to_date(col(&quot;date&quot;),&quot;yyyy-MM-dd&quot;)) \
    .orderBy(col(&quot;date&quot;)) \
    .select(col(&quot;date&quot;),col(&quot;avg_temp&quot;),col(&quot;min_temp&quot;),col(&quot;max_temp&quot;),col(&quot;wind_speed&quot;),col(&quot;snow&quot;),col(&quot;precip&quot;)) # select columns

quick_overview(spark_cleaned_df)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">FIRST RECORDS
         date  avg_temp  min_temp  max_temp  wind_speed  snow  precip
0  1979-01-01      -5.3     -12.2      -2.5    7.326082   0.0     1.0
1  1979-01-02     -10.0     -12.2      -4.2    7.326082  10.0     1.7
COUNT NULL VALUES
   avg_temp  min_temp  max_temp  wind_speed  snow  precip
0         0         0         0           0     0       0</pre></div>



<h3 class="wp-block-heading">Step #6 Transform the Data using User Defined Functions (UDFs)</h3>



<p class="wp-block-paragraph">In addition to the standard methods, PySpark offers the possibility to execute custom Python code in the Spark cluster. Such functions are called User Defined Functions (UDF). When we use UDFs, we should be aware that we won&#8217;t achieve the same performance level as standard PySpark functions.</p>



<p class="wp-block-paragraph">We want to create several bucks containing several temperature classes in the following. We can use UDFs to create these buckets. However, sometimes we can only achieve a task by using UDFs. By running the code below, we first create a normal Python function called &#8220;binner&#8221; and then hand over this function to a UDF called udf_binner_temp. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># More Infos on User Defined Functions: https://sparkbyexamples.com/pyspark/pyspark-udf-user-defined-function/
# create bucket column for temperature
def binner(min_temp, max_temp):
        if (min_temp is None) or (max_temp is None):
            return &quot;unknown&quot;
        else:
            if min_temp &lt; -10:
                return &quot;freezing cold&quot;
            elif min_temp &lt; -5:
                return &quot;very cold&quot;
            elif min_temp &lt; 0:
                return &quot;cold&quot;
            elif max_temp &lt; 10:
                return &quot;normal&quot;
            elif max_temp &lt; 20:
                return &quot;warm&quot;
            elif max_temp &lt; 30:
                return &quot;hot&quot;
            elif max_temp &gt;= 30:
                return &quot;very hot&quot;
        return &quot;normal&quot;


udf_binner_temp = udf(binner, StringType() )
spark_cleaned_df = spark_cleaned_df.withColumn(&quot;temp_buckets&quot;, udf_binner_temp(col(&quot;min_temp&quot;), col(&quot;max_temp&quot;)))
spark_cleaned_df.limit(10).toPandas()</pre></div>



<p class="wp-block-paragraph">In the same way, we create buckets to cover different levels of precipitation.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># create new columns for bucket percipitation and for month and year
udf_binner_precip = udf(lambda x: &quot;very rainy&quot; if x &gt; 50 else (&quot;rainy&quot; if x &gt; 0 else &quot;dry&quot;), StringType())
spark_cleaned_df = spark_cleaned_df \
    .withColumn(&quot;precip_buckets&quot;, udf_binner_precip(&quot;precip&quot;)) \
    .withColumn(&quot;month&quot;, month(spark_cleaned_df.date)) \
    .withColumn(&quot;year&quot;, year(spark_cleaned_df.date)) 
spark_cleaned_df.limit(5).toPandas()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	date		avg_temp	min_temp	max_temp	wind_speed	snow		precip		temp_buckets
0	1979-01-01	-5.3		-12.2		-2.5		7.326082	0.0			1.0			freezing cold
1	1979-01-02	-10.0		-12.2		-4.2		7.326082	10.0		1.7			freezing cold
2	1979-01-03	-5.8		-7.9		-3.9		7.326082	110.0		0.0			very cold
3	1979-01-04	-8.4		-10.5		-4.4		7.326082	100.0		0.0			freezing cold
4	1979-01-05	-10.0		-11.2		-7.5		7.326082	70.0		0.0			freezing cold
5	1979-01-06	-6.4		-10.1		-2.8		7.326082	50.0		0.0			freezing cold
6	1979-01-07	-3.8		-5.1		-2.9		7.326082	40.0		0.0			very cold
7	1979-01-08	-2.4		-4.5		1.4			7.326082	40.0		5.1			cold
8	1979-01-09	1.4			-0.4		3.8			7.326082	20.0		3.2			cold
9	1979-01-10	0.4			-1.4		2.8			7.326082	10.0		5.4			cold</pre></div>



<p class="wp-block-paragraph">These steps complete the data preparation. Next, we will focus on analyzing the data.</p>



<h3 class="wp-block-heading">Step #7 Exemplary Data Analysis using Seaborn</h3>



<p class="wp-block-paragraph">We can analyze the dataset now that we have a clean and consistent set of historical weather data. To aid our analysis, we will visualize the data. Since PySpark does not yet provide its functions for visualization, we will use the Python library Seaborn for visualization.</p>



<p class="wp-block-paragraph">PySpark provides two different ways to analyze data. The first is standard PySpark methods executed on a DataFrame (sums, counts, average, groupby). The second is PySpark SQL, which allows us to query the data using basic SQL syntax. </p>



<p class="wp-block-paragraph">The PySpark transformation functions and PySpark SQL are optimized for distributed, parallelized computations on large data sets. In the following, we will explore both ways. This section shows how we can now use the data to create different types of plots:</p>



<ul class="wp-block-list">
<li>Temperature scatterplot with colored dots to highlight historical developments</li>



<li>Heatmap on mean max temperature</li>



<li>Scatterplots that illustrate three-dimensional relationships</li>
</ul>



<h4 class="wp-block-heading">7.1 Temperature Scatterplot with Colored Dots</h4>



<p class="wp-block-paragraph">In this section, we create a line plot colored by temperature buckets.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># set the dimensions for the scatterplot
fig, ax = plt.subplots(figsize=(28,8))
sns.scatterplot(hue=&quot;temp_buckets&quot;, y=&quot;avg_temp&quot;, x=&quot;date&quot;, data=spark_cleaned_df.toPandas(), palette=&quot;Spectral_r&quot;)

# plot formatting 
ax.tick_params(axis=&quot;x&quot;, rotation=30, labelsize=10, length=0)

# title formatting
mindate = str(spark_cleaned_df.agg({'date': 'min'}).collect()[0]['min(date)'])
maxdate = str(spark_cleaned_df.agg({'date': 'max'}).collect()[0]['max(date)'])
ax.set_title(&quot;average temperature in Zurich: &quot; + mindate + &quot; - &quot; + maxdate)
plt.xlabel(&quot;Year&quot;)
plt.show()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	year  	 month   mean(avg_temp)  mean(max_temp)  mean(min_temp)
0  1979      5       19.180000       26.940000       12.900000
1  1979      6       19.200000       25.842857       14.385714
2  1979      7       20.828571       27.071429       15.357143
3  1979      8       20.200000       27.000000       14.216667
4  1979      9       19.000000       25.400000       13.700000</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="319" data-attachment-id="11743" data-permalink="https://www.relataly.com/pyspark-distributed-computing-tutorial-analyzing-zurich-weather-data-with-python/2739/image-2-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-2.png" data-orig-size="1619,504" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-2.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-2-1024x319.png" alt="" class="wp-image-11743" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-2.png 1024w, https://www.relataly.com/wp-content/uploads/2022/12/image-2.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-2.png 768w, https://www.relataly.com/wp-content/uploads/2022/12/image-2.png 1536w, https://www.relataly.com/wp-content/uploads/2022/12/image-2.png 1619w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">The outliers to the bottom and top are interesting. While the upward temperature peaks remain relatively stable, the lower temperatures have decreased in the past decades.</p>



<h4 class="wp-block-heading">7.2 Heatmap on Mean Max Temperature</h4>



<p class="wp-block-paragraph">Next, we want to create a heatmap illustrating how maximum temperature develops over years and months. We first use PySpark to compute the mean of the maximum temperature over years and months. Then we use seaborn to create a heatmap.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">spark_cleaned_df_agg = spark_cleaned_df.select(col(&quot;year&quot;),col(&quot;month&quot;),col(&quot;max_temp&quot;)) \
    .filter(spark_cleaned_df.year &lt; 2021)\
    .groupby(col(&quot;year&quot;),col(&quot;month&quot;))\
    .agg(mean(&quot;max_temp&quot;).alias(&quot;mean_max_temp&quot;)) \
    .orderBy(col(&quot;year&quot;)).toPandas()

plt.figure(figsize=(24,6))
avg_temp_df = spark_cleaned_df_agg.pivot(&quot;month&quot;, &quot;year&quot;, &quot;mean_max_temp&quot;)
sns.heatmap(avg_temp_df, cmap=&quot;Spectral_r&quot;, linewidths=.5)</pre></div>



<figure class="wp-block-image size-large is-resized is-style-default"><img decoding="async" data-attachment-id="6145" data-permalink="https://www.relataly.com/pyspark-distributed-computing-tutorial-analyzing-zurich-weather-data-with-python/2739/output-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/03/output-3.png" data-orig-size="1219,371" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output-3" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/03/output-3.png" src="https://www.relataly.com/wp-content/uploads/2022/03/output-3-1024x312.png" alt="PySpark big data analytics tutorial - weather heatmap" class="wp-image-6145" width="933" height="284" srcset="https://www.relataly.com/wp-content/uploads/2022/03/output-3.png 1024w, https://www.relataly.com/wp-content/uploads/2022/03/output-3.png 300w, https://www.relataly.com/wp-content/uploads/2022/03/output-3.png 768w, https://www.relataly.com/wp-content/uploads/2022/03/output-3.png 1219w" sizes="(max-width: 933px) 100vw, 933px" /></figure>



<h4 class="wp-block-heading">7.3 Scatterplots</h4>



<p class="wp-block-paragraph">Next, the goal is to understand the relationship between temperature and weather effects, such as wind or snow. In the following, we will create different scatterplots to display the relationship between other variables. We will show the relationship between min (x-axis) and max temperature (y-axis). In addition, we color the dots to visualize relationships with an additional variable. The code below creates three such scatter plots. The plots display the relationships between the following variables:</p>



<ul class="wp-block-list">
<li>The first plots should illustrate the relationship between temperature and snowfall</li>



<li>We want to demonstrate how wind speed depends on the daily min and max temperature</li>



<li>We want to create a plot that visualizes how daily min and max temperature change over the month</li>
</ul>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">fig, axes= plt.subplots(nrows=1, ncols=3, sharey=True, figsize=(30, 10))
fig.subplots_adjust(hspace=0.5, wspace=0.2)

palette = sns.color_palette(&quot;ch:start=.2,rot=-.3&quot;, as_cmap=True)
sns.scatterplot(ax = axes[0], hue=&quot;snow&quot;, size=&quot;snow&quot;, y=&quot;max_temp&quot;, x=&quot;min_temp&quot;, data=spark_cleaned_df.toPandas(), alpha=1.0, palette=palette)
axes[0].legend(bbox_to_anchor=(1.01, 1), loc=2, borderaxespad=0.)
axes[0].set_title(&quot;min - max temperature seperated by snow&quot;)

sns.scatterplot(ax = axes[1], hue=&quot;wind_speed&quot;, size=&quot;wind_speed&quot;, y=&quot;max_temp&quot;, x=&quot;min_temp&quot;, data=spark_cleaned_df.toPandas(), alpha=1.0, palette='rocket_r')
axes[1].legend(bbox_to_anchor=(1.01, 1), loc=2, borderaxespad=0.)
axes[1].set_title(&quot;min - max temperature seperated by wind speed&quot;)

sns.scatterplot(ax = axes[2], hue=&quot;month&quot;, y=&quot;max_temp&quot;, x=&quot;min_temp&quot;, data=spark_cleaned_df.toPandas(), alpha=1.0, palette='Spectral_r', hue_norm=(1,12), legend=&quot;full&quot;)
axes[2].legend(bbox_to_anchor=(1.01, 1), loc=2, borderaxespad=0.)
axes[2].set_title(&quot;min - max temperature seperated by month&quot;)
</pre></div>



<figure class="wp-block-image size-large is-resized is-style-default"><img decoding="async" data-attachment-id="6101" data-permalink="https://www.relataly.com/pyspark-distributed-computing-tutorial-analyzing-zurich-weather-data-with-python/2739/output/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/03/output.png" data-orig-size="1448,497" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/03/output.png" src="https://www.relataly.com/wp-content/uploads/2022/03/output-1024x351.png" alt="PySpark big data analytics tutorial - weather dot plots" class="wp-image-6101" width="941" height="324" srcset="https://www.relataly.com/wp-content/uploads/2022/03/output.png 1024w, https://www.relataly.com/wp-content/uploads/2022/03/output.png 300w, https://www.relataly.com/wp-content/uploads/2022/03/output.png 768w, https://www.relataly.com/wp-content/uploads/2022/03/output.png 1448w" sizes="(max-width: 941px) 100vw, 941px" /></figure>



<p class="wp-block-paragraph">There are various things we can learn from these plots:</p>



<ol class="wp-block-list">
<li>It is not surprising that most of the snowfall happens at low temperatures. We can also see that most snowfall is around zero degrees Celsius. We also observe some outliers in deep temperature and higher temperature areas. </li>



<li>The highest wind speeds are located in the middle of the plot, where we can also observe the most significant daily difference between the minimum and the maximum temperatures.</li>



<li>The last chart reflects the different seasons. Through the colored areas, we can distinguish the temperature ranges of the individual months.</li>
</ol>



<h3 class="wp-block-heading" id="h-step-8-pyspark-sql">Step #8 PySpark SQL</h3>



<p class="wp-block-paragraph">PySpark SQL combines classic SQL syntax with distributed computations, which results in increased query performance. Spark SQL is similar to conventional SQL dialects such as MySQL or Microsoft Server. However, Spark provides a dedicated set of functions for everyday data operations such as string transformations, type conversions, etc. The Spark SQL <a href="https://spark.apache.org/docs/latest/sql-ref.html" target="_blank" rel="noreferrer noopener">reference guide</a> offers more information on this topic. </p>



<p class="wp-block-paragraph">This section aims to extend our analysis by using PySpark SQL. We will display how the daily temperature averages (for minimum, maximum, and mean) have changed. </p>



<p class="wp-block-paragraph">First, we have to register our dataset as a temporary view. Once we have done that, we can run ad-hoc queries against the View using Spark SQL commands. As you can see, we can use traditional SQL syntax here. The SQL function returns the results in the form of a DataFrame. Running the code below will perform these actions and visualize the results in a line plot.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Data Analysis using PySpark.SQL

# register the dataset as a temp view, so we can use sq
spark_cleaned_df.createOrReplaceTempView(&quot;waeather_data_temp_view&quot;)

# see how event numbers have evolved over years
events_over_years_df = spark.sql( \
        'SELECT year, month, mean(avg_temp), mean(max_temp), mean(min_temp) \
        FROM waeather_data_temp_view \
        WHERE max_temp &gt; 25 \
        GROUP BY month, year \
        ORDER BY year, month')
print(events_over_years_df.limit(5).toPandas())

plt.figure(figsize=(16,6))
fig = sns.lineplot(y=&quot;mean(avg_temp)&quot;, x=&quot;year&quot;, data=events_over_years_df.toPandas(), color= &quot;orange&quot;)
fig = sns.lineplot(y=&quot;mean(max_temp)&quot;, x=&quot;year&quot;, data=events_over_years_df.toPandas(), color= &quot;red&quot;)
fig = sns.lineplot(y=&quot;mean(min_temp)&quot;, x=&quot;year&quot;, data=events_over_years_df.toPandas(), color= &quot;blue&quot;)
plt.grid()
plt.show()
</pre></div>



<figure class="wp-block-image size-full is-resized is-style-default"><img decoding="async" data-attachment-id="6146" data-permalink="https://www.relataly.com/pyspark-distributed-computing-tutorial-analyzing-zurich-weather-data-with-python/2739/output-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/03/output-4.png" data-orig-size="951,371" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output-4" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/03/output-4.png" src="https://www.relataly.com/wp-content/uploads/2022/03/output-4.png" alt="PySpark big data analytics tutorial 
- Temperature curves created" class="wp-image-6146" width="995" height="388" srcset="https://www.relataly.com/wp-content/uploads/2022/03/output-4.png 951w, https://www.relataly.com/wp-content/uploads/2022/03/output-4.png 300w, https://www.relataly.com/wp-content/uploads/2022/03/output-4.png 768w" sizes="(max-width: 995px) 100vw, 995px" /></figure>



<p class="wp-block-paragraph">We can express the same Spark query with the DataFrame API. The result is the same.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Using PySpark standard functions
events_over_years_df = spark_cleaned_df\
    .filter(spark_cleaned_df.year &lt; 2021)\
    .groupby(col(&quot;year&quot;))\
    .agg(mean(&quot;avg_temp&quot;).alias(&quot;mean_avg_temp&quot;), 
         mean(&quot;min_temp&quot;).alias(&quot;mean_min_temp&quot;), 
         mean(&quot;max_temp&quot;).alias(&quot;mean_max_temp&quot;))\
    .orderBy(col(&quot;year&quot;)).toPandas()

plt.figure(figsize=(15,6))</pre></div>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p class="wp-block-paragraph">Congratulation, you have made your first steps with distributed computing! This tutorial has demonstrated PySpark &#8211; a distributed computing framework for big data processing and analytics. After initializing the Spark session, we prepared a weather dataset and tested several PySpark functions to process and analyze it. Among the primary data operations were:</p>



<ul class="wp-block-list">
<li>Ingestion of CSV files with the read method</li>



<li>Joining PySpark DataFrames</li>



<li>Cleaning the data by treating duplicates and Null values</li>



<li>Filtering the data and selecting specific columns</li>



<li>Performing custom computations with UDFs</li>



<li>Running analytics and creating plots</li>
</ul>



<p class="wp-block-paragraph">Although we have used a wide range of functions, we touched only upon the surface of PySpark&#8217;s capabilities. We can do many cool things with PySpark, including real-time data streaming and efficient training of big data machine learning models. We will cover these topics in separate tutorials.</p>



<p class="wp-block-paragraph">Thanks for reading. I am always happy to receive feedback on my articles. If you have any questions, please let me know in the comments. </p>



<h2 class="wp-block-heading" id="h-sources-and-further-reading">Sources and Further Reading</h2>



<ol class="wp-block-list"><li><a href="https://amzn.to/3gcImKf" target="_blank" rel="noreferrer noopener">Wenig Damji Das and Lee (2020) Learning Spark: Lightning-fast Data Analytics</a></li><li><a href="https://amzn.to/3exFyHb" target="_blank" rel="noreferrer noopener">Chambers and Zaharu (2018) Spark: The Definitive Guide: Big data processing made simple</a></li><li><a href="https://databricks.com/de/wp-content/uploads/2018/12/nsdi_spark.pdf" target="_blank" rel="noreferrer noopener">Resilient Distributed Datasets: A Fault-Tolerant Abstraction for</a></li><li><a href="https://databricks.com/de/wp-content/uploads/2018/12/nsdi_spark.pdf" target="_blank" rel="noreferrer noopener">In-Memory Cluster Computing.pdf</a></li><li><a href="https://amzn.to/3MAy8j5" target="_blank" rel="noreferrer noopener">Andriy Burkov (2020) Machine Learning Engineering</a></li><li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li></ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<p class="wp-block-paragraph">Articles</p>



<figure class="wp-block-embed is-type-wp-embed is-provider-relataly-com wp-block-embed-relataly-com"><div class="wp-block-embed__wrapper">
<blockquote class="wp-embedded-content" data-secret="5mNHJ6qyTd"><a href="https://www.relataly.com/introduction-to-bigdata-analytics-with-apache-spark-concepts-and-architecture/6324/">Getting Started with Big Data Analytics &#8211; Apache Spark Concepts and Architecture</a></blockquote><iframe loading="lazy" class="wp-embedded-content" sandbox="allow-scripts" security="restricted"  title="&#8220;Getting Started with Big Data Analytics &#8211; Apache Spark Concepts and Architecture&#8221; &#8212; relataly.com" src="https://www.relataly.com/introduction-to-bigdata-analytics-with-apache-spark-concepts-and-architecture/6324/embed/#?secret=jTuO4a19jc#?secret=5mNHJ6qyTd" data-secret="5mNHJ6qyTd" width="600" height="338" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
</div></figure>
<p>The post <a href="https://www.relataly.com/pyspark-distributed-computing-tutorial-analyzing-zurich-weather-data-with-python/2739/">Leveraging Distributed Computing for Weather Analytics with PySpark</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/pyspark-distributed-computing-tutorial-analyzing-zurich-weather-data-with-python/2739/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2739</post-id>	</item>
		<item>
		<title>Getting Started with Big Data Analytics &#8211; Apache Spark Concepts and Architecture</title>
		<link>https://www.relataly.com/introduction-to-bigdata-analytics-with-apache-spark-concepts-and-architecture/6324/</link>
					<comments>https://www.relataly.com/introduction-to-bigdata-analytics-with-apache-spark-concepts-and-architecture/6324/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Tue, 22 Mar 2022 22:56:52 +0000</pubDate>
				<category><![CDATA[PySpark]]></category>
		<category><![CDATA[Advanced Tutorials]]></category>
		<category><![CDATA[Apache Spark]]></category>
		<category><![CDATA[BigData Analytics]]></category>
		<category><![CDATA[Distributed Computing]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=6324</guid>

					<description><![CDATA[<p>Apache Spark is an absolute powerhouse when it comes to open-source Big Data processing and analytics. It&#8217;s used all over the place for everything from data processing to machine learning to real-time stream processing. Thanks to its distributed architecture, it can parallelize workloads like nobody&#8217;s business, making it a lean, mean data processing machine when ... <a title="Getting Started with Big Data Analytics &#8211; Apache Spark Concepts and Architecture" class="read-more" href="https://www.relataly.com/introduction-to-bigdata-analytics-with-apache-spark-concepts-and-architecture/6324/" aria-label="Read more about Getting Started with Big Data Analytics &#8211; Apache Spark Concepts and Architecture">Read more</a></p>
<p>The post <a href="https://www.relataly.com/introduction-to-bigdata-analytics-with-apache-spark-concepts-and-architecture/6324/">Getting Started with Big Data Analytics &#8211; Apache Spark Concepts and Architecture</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Apache Spark is an absolute powerhouse when it comes to open-source Big Data processing and analytics. It&#8217;s used all over the place for everything from data processing to machine learning to real-time stream processing. Thanks to its distributed architecture, it can parallelize workloads like nobody&#8217;s business, making it a lean, mean data processing machine when it comes to tackling those truly massive data sets.</p>



<p class="wp-block-paragraph">So, what&#8217;s the deal with Spark? In this article, we&#8217;re going to give you a rundown of its key concepts and architecture. First things first, we&#8217;ll provide a high-level overview of Spark, including its standout features. We&#8217;ll even throw in a comparison to Apache Hadoop, another popular open-source Big Data processing engine, just for good measure.</p>



<p class="wp-block-paragraph">From there, we&#8217;ll dive into the nitty-gritty of Spark&#8217;s architecture. You&#8217;ll learn all about how it divides up tasks and data among a cluster of machines, all optimized for top-notch performance when dealing with massive workloads. This is the stuff you need to know to get the most out of Spark, and we&#8217;re going to deliver it straight to you in plain language that anyone can understand.</p>



<p class="wp-block-paragraph">By the end of this article, you&#8217;ll be well-equipped to get started with Apache Spark, armed with a solid understanding of its core concepts and architecture. You&#8217;ll know exactly how it works and how to use it to process and analyze those gargantuan data sets that would make other Big Data processing engines run for the hills. So let&#8217;s get started and unleash the power of Apache Spark!</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading">What Makes Apache Spark so Successful? </h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Spark has been able to catch a lot of attention in recent years. Due to growing data volumes, more companies are using Apache Spark to handle their steadily increasing data. Companies that use Apache Spark include Amazon, Netflix, Disney, and even NASA. NASA uses Spark to process a data flow of many TB/s that comes from all of their earth monitoring instruments and space observatories around the globe (more on the NASA <a href="https://www.nasa.gov/directorates/heo/scan/services/networks/deep_space_network/about" target="_blank" rel="noreferrer noopener">webpage</a>). Pretty cool, right?</p>



<p class="wp-block-paragraph">There are various reasons for Spark&#8217;s success. One reason is scalability. Spark has been consistently optimized for parallel computation in scalable distributed systems. Its architecture scales with the workload, thus allowing it to handle different workloads efficiently. This is important as companies want to scale their usage of resources in the cloud depending on use case-specific needs.</p>



<p class="wp-block-paragraph">Another reason is Spark&#8217;s versatility of use. Spark supports diverse workloads, including batch processing and streaming, machine learning, and SQL analytics. In addition, Spark supports various data types and storage formats, further increasing its versatility and use in different domains. The unique combination of several features and capabilities makes Apache Spark an exciting piece of technology and creates a strong demand for Spark knowledge in the market.</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/pyspark-distributed-computing-tutorial-analyzing-zurich-weather-data-with-python/2739/" target="_blank" rel="noreferrer noopener">Leveraging Distributed Computing for Weather Analytics with PySpark</a> </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized is-style-default"><img decoding="async" data-attachment-id="5939" data-permalink="https://www.relataly.com/pyspark-distributed-computing-tutorial-analyzing-zurich-weather-data-with-python/2739/image-123/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/11/image.png" data-orig-size="300,168" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/11/image.png" src="https://www.relataly.com/wp-content/uploads/2021/11/image.png" alt="Apache Spark is a powerful library for processing big data in a distributed fashion" class="wp-image-5939" width="252" height="141"/><figcaption class="wp-element-caption">Apache Spark &#8211; one of the best open-source engines for Big Data processing and analysis</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">Spark Features</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Spark was initially launched in 2009 by Matei Zaharia at UC Berkeley&#8217;s AMPLab. It was later released as open-source under a BSD license in 2010. Since 2013, Spark has been part of the Apache Software Foundation, which currently (2022) lists it as a top-priority project. Key features of the Spark ecosystem are:</p>



<ul class="wp-block-list">
<li><strong>In-Memory Processing</strong>: Spark is designed for in-memory computations to avoid writing data to disk and achieve high speed.</li>



<li><strong>Open Source</strong>: The development and promotion of Spark are driven by a growing community of practitioners and researchers.</li>



<li><strong>Reusability:</strong> Spark promotes code reuse through standard functions and method piling. </li>



<li><strong>Fault Tolerance:</strong> Spark comes with built-in fault tolerance, making Spark resilient against outtakes of cluster nodes. </li>



<li><strong>Real-time Streaming:</strong> Spark provides enhanced capabilities for real-time streaming of big data.</li>



<li><strong>Compatibility:</strong> Spark integrates well with existing big-data technologies such as Hadoop and offers support for various programming languages, incl. R, Python, Scala, and Java. </li>



<li><strong>Built-in Optimization:</strong> Spark has a built-in optimization engine. It is based on the lazy evaluation paradigm and uses acyclic-directed graphs (DAG) to create execution plans that optimize transformations.</li>



<li><strong>Analytics-ready:</strong> Spark provides powerful libraries for querying structured data inside Spark, using either SQL or DataFrame API. In addition, Spark comes with ML Flow, an integrated library for machine learning that can execute large ML workloads in memory. </li>
</ul>



<p class="wp-block-paragraph">Spark combines these features and provides a one-stop-shop solution, thus making it extremely powerful and versatile.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<p class="wp-block-paragraph">. </p>



<h2 class="wp-block-heading" id="h-local-vs-distributed-computing">Local vs. Distributed Computing</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">To understand how Spark works, it&#8217;s helpful to compare a local system to a distributed computing system. While a local system can handle smaller workloads, it has several limitations when it comes to Big Data processing. </p>



<p class="wp-block-paragraph">One limitation is the number of cores. Modern CPUs typically have at least two processing cores, but the maximum number of cores in a single machine is around 16 to 32. This means that the number of cores does not scale, which is a problem when working with large data sets.</p>



<p class="wp-block-paragraph">Another limitation is the amount of RAM that a single computer can handle. While it&#8217;s theoretically possible to increase RAM up to a Terrabyte (TB), this is not cost-efficient. Additionally, running extensive computations on a single machine can be risky, as a single error can require starting the entire job over again.</p>



<p class="wp-block-paragraph">To overcome these limitations, it makes sense to move toward a distributed architecture. In a distributed system, a network of machines can work on several tasks in parallel. This is where Spark shines, as it is designed to run best on a computing cluster.</p>



<p class="wp-block-paragraph">By leveraging a distributed architecture, Spark can parallelize workloads and optimize performance to handle massive data sets. With a network of machines working together, Spark can tackle complex computations in a robust and efficient way. In summary, the move to a distributed architecture is critical for overcoming the limitations of a local system and unlocking the full potential of Spark.</p>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="581" data-attachment-id="6725" data-permalink="https://www.relataly.com/introduction-to-bigdata-analytics-with-apache-spark-concepts-and-architecture/6324/image-4-14/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-4.png" data-orig-size="1494,847" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-4" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-4.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-4-1024x581.png" alt="Local System vs Distributed System, Spark Tutorial" class="wp-image-6725" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-4.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/image-4.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/image-4.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-4.png 1494w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">A local system, in comparison with a distributed system</figcaption></figure>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading">Cluster Computations</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">When it comes to Big Data, you want to parallelize as much as possible, as this can significantly reduce the time for processing workloads. If we speak about processing Terrabytes or Petabytes of data, we need to distribute the workload among a whole cluster of computers with hundreds or even thousands of cores. Here we run into another limitation: parallelization requires us to distribute tasks among cluster nodes, split the data, and store smaller chunks of data in each node. </p>



<p class="wp-block-paragraph">Distributing the data across a cluster means transferring the data to the machines and either writing it to the hard disk or storing it in the Random Access Memory (RAM). Both choices have their pros and cons. While disk space is typically cheap, write operations are generally slow and thus expensive in terms of performance. Keeping the data in memory (RAM) is much faster. However, RAM is typically expensive. Compared to disk, there is less RAM space available to store the data, which requires breaking them up into a more significant number of smaller partitions.</p>



<h2 class="wp-block-heading">Spark vs. Hadoop</h2>



<p class="wp-block-paragraph">Cluster computing frameworks existed long before Apache Spark. For example, Apache Hadoop was developed long before Apache Spark and used by many enterprises to process large workloads for some time.</p>



<h3 class="wp-block-heading">A Matter of Persistence</h3>



<p class="wp-block-paragraph">Hadoop distributes the data across the nodes of a computing cluster. It then runs local calculations on the cluster nodes and collects the results. For this purpose, Hadoop relies on MapReduce, which persists the data to a distributed file system. An example is the <a href="https://databricks.com/de/glossary/hadoop-distributed-file-system-hdfs" target="_blank" rel="noreferrer noopener">Hadoop Distributed File System (HDFS)</a>. The advantage is that users don&#8217;t have to worry about task distributions or fault tolerance. Hadoop handles all of this under the hood. </p>



<p class="wp-block-paragraph">The downside of MapReduce is that reads and writes take place on the local table storage, which is typically slow. However, modern analytics use cases often rely on the swift processing of large data volumes. Or in the case of machine learning and ad-hoc analytics, they require frequent iterative changes to the data. For these use cases, Spark provides a faster solution. Apache spark overcomes the costly persistence of data to the local table storage by keeping it in the RAM of the worker nodes. Spark will only spill data to the disk if the memory is full. Keeping data in memory can improve the performance of large workloads by an order of magnitude. The speed advantage can be as much as 100x.</p>



<h3 class="wp-block-heading">When to Use Spark?</h3>



<p class="wp-block-paragraph">Spark&#8217;s in-memory approach offers a significant advantage if speed is crucial and the data fits into RAM. Especially when use cases require real-time analytics, Spark offers superior performance over Hadoop MapReduce. Such cases are, for example:</p>



<ul class="wp-block-list">
<li><strong>Credit card fraud detection</strong>: Analysis of large volumes of transactions in real-time.</li>



<li><strong>Predictive Maintenance: </strong>Prediction of sensor data anomalies and potential machine failures with real-time analysis of IoT sensor data from machines and production equipment.</li>
</ul>



<p class="wp-block-paragraph">However, Spark is not always the best solution. For example, Hadoop may still be a great choice when processing speed is not critical. Sometimes the data may be too big for in-memory storage, even on a large computing cluster. In such cases, using Hadoop on a cluster optimized for this purpose can be more cost-effective. So, there are also cases where Hadoop offers advantages over Spark.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading" id="h-spark-architecture">Spark Architecture</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The secret to Spark&#8217;s performance is massive parallelization. Parallelization is achieved by decomposing a complex application into smaller jobs executed in parallel. Understanding this process requires looking at the distributed cluster upon which Spark will run its jobs. </p>



<p class="wp-block-paragraph">A Spark cluster is a collection of virtual machines that allows us to run different workloads, including ETL and machine learning. Each computer typically has its RAM and multiple cores, allowing the system to speed up computations massively. Of course, cluster sizes may vary, but the general architecture remains generic.</p>



<p class="wp-block-paragraph">A spark computing cluster has a driver node and several worker nodes. The driver node contains a driver program that sends tasks via a cluster manager to the worker nodes. Each worker node has an executer and holds only a fraction of the total data in its cache. </p>



<p class="wp-block-paragraph">After job completion, the worker nodes send back the results, where the driver program aggregates them. This way, Spark avoids storing the data in the local disk storage and achieves swift execution. </p>



<p class="wp-block-paragraph">Another advantage of this architecture is easy scaling. In a cluster, you can add machines to the network to increase computational power or remove them again when they are not needed.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="alignright size-large is-resized"><img decoding="async" data-attachment-id="6366" data-permalink="https://www.relataly.com/introduction-to-bigdata-analytics-with-apache-spark-concepts-and-architecture/6324/image-15-11/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/03/image-15.png" data-orig-size="1445,974" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-15" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/03/image-15.png" src="https://www.relataly.com/wp-content/uploads/2022/03/image-15-1024x690.png" alt="Interplay between the different components of a Spark computing cluster, Spark Tutorial" class="wp-image-6366" width="417" height="281" srcset="https://www.relataly.com/wp-content/uploads/2022/03/image-15.png 1024w, https://www.relataly.com/wp-content/uploads/2022/03/image-15.png 300w, https://www.relataly.com/wp-content/uploads/2022/03/image-15.png 768w, https://www.relataly.com/wp-content/uploads/2022/03/image-15.png 1445w" sizes="(max-width: 417px) 100vw, 417px" /><figcaption class="wp-element-caption">The interplay between the different components of a Spark computing cluster<br>Source: based on https://spark.apache.org/docs/latest/cluster-overview.html</figcaption></figure>
</div></div>
</div>



<h3 class="wp-block-heading">Spark Cluster Components</h3>



<p class="wp-block-paragraph">Clusters consist of a single driver node, multiple worker nodes, and several other key components:</p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<h4 class="wp-block-heading">Driver Node</h4>



<p class="wp-block-paragraph">The driver node has the role of an orchestrator in the computing cluster. It runs a java process that decomposes complex transformations (transformation and action) into the smallest units of work called tasks. It distributes them to the worker nodes for parallel execution as part of an execution plan (more on this topic later). </p>



<p class="wp-block-paragraph">The driver program also launches the workers who read the input data from a distributed file system, decompose it into smaller partitions, and persist them in memory.</p>



<p class="wp-block-paragraph">Spark has been designed to optimize task planning and execution automatically. In addition, the driver creates the Spark context (and Spark Session) that provides access to and deployment of cluster information, functions, and configurations.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<h4 class="wp-block-heading">Worker Node</h4>



<p class="wp-block-paragraph">Worker nodes are the entities that get the actual work done. Each worker hosts an executor on which the individual tasks run. Workers have a fixed number of executors that contain free slots for several tasks. </p>



<p class="wp-block-paragraph">Workers also have their dedicated cache memory to store the data during processing. However, when the data is too big to store in memory, the executor will spill it onto the disk. The number of slots and ram available to store the data depends on the cluster configuration and its underlying hardware setup.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<h4 class="wp-block-heading">Cluster Manager</h4>



<p class="wp-block-paragraph">A cluster manager is the core process in Spark responsible for launching executors and managing the optimal distribution of tasks and data across a distributed computation cluster. </p>



<p class="wp-block-paragraph">Spark supports different disk storage formats, including HDSF and Cassandra. The storage format is essential because it describes how Spark distributes the data across a network. For example, HDFS partitions the data into blocks of 128 MB and replicates these blocks three times &#8211; in this way, creating duplicate data to achieve fault tolerance. The fact that Spark can handle different storage formats makes it very flexible.</p>
</div>
</div>



<h4 class="wp-block-heading" id="h-executor">Executor</h4>



<p class="wp-block-paragraph">The executors are processes launched in coordination with the cluster manager. Once the executor is launched, it runs as long as the Spark application runs. They are responsible for executing tasks and returning the results to the driver. Each executor holds a fraction (a Spark Partition) of the data to be processed. The executor persists the data in the cache memory of the worker node unless the cache is full. </p>



<h2 class="wp-block-heading">The Resilient Distributed Dataset (RDD) </h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The backbone of Apache Spark is the Resilient Distributed Dataset (RDD). The RDD is a distributed memory abstraction that provides functions for in-memory computations on large computing clusters. </p>



<p class="wp-block-paragraph">Imagine you want to carry out a transformation on a set of historical temperature data. Running this on Spark will create an RDD. The RDD partitions the data and distributes the partitions to different worker nodes in the cluster.</p>



<p class="wp-block-paragraph">The RDD has four main features:</p>



<ul class="wp-block-list">
<li><strong>Partitioning</strong>: An RDD is a distributed collection of data elements partitioned across nodes in a cluster. Partitions are sets of records stored on one physical machine in the cluster.</li>



<li><strong>Resilience:</strong> RDDs achieve fault tolerance against failures of individual nodes through redundant data replication. </li>



<li><strong>Interface</strong>: The RDD provides a low-level API that allows performing transformations and tasks parallel to the data.</li>



<li><strong>Immutability:</strong> An RDD is an immutable replication of data, which means that changing the data requires creating a new RDD. </li>
</ul>



<p class="wp-block-paragraph">Since Spark 2.0, the Spark syntax orients towards the DataFrame object. As a result, you won&#8217;t interact with the RDD object anymore when writing code. However, the RDD is still how Spark stores data in memory and the RDD continues to excel under the hood.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="alignright size-large is-resized"><img decoding="async" data-attachment-id="6722" data-permalink="https://www.relataly.com/introduction-to-bigdata-analytics-with-apache-spark-concepts-and-architecture/6324/image-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-3.png" data-orig-size="1773,1440" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-3" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-3.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-3-1024x832.png" alt="Distribution of Partitioned Data across a Spark Cluster with RDD, Apache Spark Tutorial" class="wp-image-6722" width="413" height="335" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-3.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/image-3.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/image-3.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-3.png 1536w, https://www.relataly.com/wp-content/uploads/2022/04/image-3.png 1773w" sizes="(max-width: 413px) 100vw, 413px" /><figcaption class="wp-element-caption">Partitioning and distributing data across nodes in the process of creating an RDD</figcaption></figure>
</div></div>
</div>



<h3 class="wp-block-heading">Performance Optimization</h3>



<p class="wp-block-paragraph">Processing data in distributed systems leads to an increased need for optimization. Spark already takes care of several optimizations, and new Spark versions constantly get better at improving performance. However, there are still cases where it is necessary to finetune things manually. </p>



<p class="wp-block-paragraph">Essential concepts that allow Spark to distribute the computation cluster&#8217;s work efficiently are &#8220;lazy code evaluation&#8221; and &#8220;data shuffling.&#8221; </p>



<h3 class="wp-block-heading">Lazy Code Evaluation</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">An essential pillar of Spark is the so-called lazy evaluation of transformations. The underlying premise is that Spark does not directly modify the data but only when the user calls an action. </p>



<p class="wp-block-paragraph">Examples of transformations are join, select, groupBy, or partitionBy. When the user calls an &#8220;action,&#8221; the execution of transformations creates a new RDD. In contrast, actions collect results from the worker nodes and return them to the driver node. Examples of actions are &#8220;first,&#8221; &#8220;max&#8221;, &#8220;count&#8221;, or &#8220;collect&#8221;. For a complete list of transformations and actions, visit the Spark website.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">sparkDF.withColumn('split_text', F.explode(F.split(F.lower('Word'), ''))) \
         .filter(&quot;split_text != ''&quot;) \
         .groupBy(&quot;split_text&quot;, &quot;language&quot;)\
         .pivot(&quot;split_text&quot;) \
         .agg(count(&quot;split_text&quot;)
     ).drop(&quot;Word_new&quot;, &quot;Word&quot;, &quot;split_text&quot;).orderBy([&quot;language&quot;], ascending=[0, 0]).na.fill(0)
</pre></div>
</div>
</div>



<p class="wp-block-paragraph">Spark can look at all transformations from a holistic perspective by lazily evaluating the code and carrying out optimizations. Before executing the transformation, Spark will generate an execution plan that considers all the operations and beneficially arranges them. </p>



<p class="wp-block-paragraph">It is worth mentioning that Spark supports methods piling so that programmers can group multiple transformations without worrying about their order. The code example on the right illustrates this approach for the Python-specific API of Spark called PySpark.</p>



<h3 class="wp-block-heading">Shuffle Operations</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">Shuffle is a technique for re-distributing data in a computation cluster. After a shuffle operation, the partitions are grouped differently among the cluster nodes. The illustration demonstrates this procedure. </p>



<h4 class="wp-block-heading">An Expensive Operation</h4>



<p class="wp-block-paragraph">Shuffle typically requires copying data between executors and machines, thus making it an expensive operation. It requires data transfer between worker nodes, resulting in disk I/O and network-I/O. For this reason, we should avoid shuffle operations, if possible. However, several common transformations will invoke shuffle. Examples are join-operations, repartition-operations, and groupBy-operations. These operations will not work without shuffling the data because they often require a complete dataset. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow"><div class="wp-block-image">
<figure class="alignright size-large is-resized"><img decoding="async" data-attachment-id="6720" data-permalink="https://www.relataly.com/introduction-to-bigdata-analytics-with-apache-spark-concepts-and-architecture/6324/image-2-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-2.png" data-orig-size="1092,924" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-2.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-2-1024x866.png" alt="Spark Shuffle Operation Between Two Worker Nodes, Apache Spark Tutorial" class="wp-image-6720" width="407" height="344" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-2.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/image-2.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/image-2.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-2.png 1092w" sizes="(max-width: 407px) 100vw, 407px" /><figcaption class="wp-element-caption">Spark shuffle operation between two worker nodes</figcaption></figure>
</div></div>
</div>



<h4 class="wp-block-heading">Avoiding Shuffle</h4>



<p class="wp-block-paragraph">Spark has several built-in features that help Spark avoid shuffling. However, some of these activities, such as scanning large datasets, will also cost performance. If the user already knows the scope and structure of the data, they should provide this information directly to Spark. Spark offers various parameters in its APIs for this purpose.</p>



<p class="wp-block-paragraph">Avoiding shuffle operations may require manual optimization. A typical example is working with data sets that differ significantly in size (skewed data). For example, when doing join operations between a large and a small dataset, we can avoid shuffle operations by providing each worker node with a full copy of the smaller dataset. For this purpose, Spark has a broadcast function. The broadcast function copies a dataset to all worker nodes in the cluster, reducing the need to exchange partitions between nodes.</p>



<p class="wp-block-paragraph">There are many more topics around optimizations. But since it&#8217;s a complex topic, I&#8217;ll cover it in a separate article.</p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p class="wp-block-paragraph">This article has introduced Apache Spark, a modern framework for processing and analyzing Big Data on distributed systems. We have covered the basics of Spark&#8217;s architecture and how Spark distributes data and tasks among a computation cluster. Finally, we have looked at how Spark optimizes computations in distributed systems and why it is often necessary to improve things manually. </p>



<p class="wp-block-paragraph">Now that you have a decent understanding of the essential concepts behind Spark, we can gather hands-on experience. The following article will work with PySpark, the Python-specific interface for Apache Spark. In this tutorial, we will process and analyze weather data using PySpark. </p>



<p class="wp-block-paragraph"><a href="https://www.relataly.com/analyzing-zurich-weather-data-with-python-and-pyspark/2739/">Analyzing Zurich Weather Data with PySpark</a></p>



<p class="wp-block-paragraph">Thanks for reading, and if you have any questions, please let me know in the comments. </p>



<h2 class="wp-block-heading" id="h-sources-and-further-reading">Sources and Further Reading</h2>



<ol class="wp-block-list"><li><a href="https://amzn.to/3gcImKf" target="_blank" rel="noreferrer noopener">Wenig Damji Das and Lee (2020) Learning Spark: Lightning-fast Data Analytics</a></li><li><a href="https://amzn.to/3exFyHb" target="_blank" rel="noreferrer noopener">Chambers and Zaharu (2018) Spark: The Definitive Guide: Big data processing made simple</a></li><li><a href="https://databricks.com/de/wp-content/uploads/2018/12/nsdi_spark.pdf" target="_blank" rel="noreferrer noopener">Resilient Distributed Datasets: A Fault-Tolerant Abstraction for</a></li><li><a href="https://databricks.com/de/wp-content/uploads/2018/12/nsdi_spark.pdf" target="_blank" rel="noreferrer noopener">In-Memory Cluster Computing.pdf</a></li><li><a href="https://amzn.to/3MAy8j5" target="_blank" rel="noreferrer noopener">Andriy Burkov (2020) Machine Learning Engineering</a></li><li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li></ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/introduction-to-bigdata-analytics-with-apache-spark-concepts-and-architecture/6324/">Getting Started with Big Data Analytics &#8211; Apache Spark Concepts and Architecture</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/introduction-to-bigdata-analytics-with-apache-spark-concepts-and-architecture/6324/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">6324</post-id>	</item>
		<item>
		<title>Stock Market Prediction using Multivariate Time Series and Recurrent Neural Networks in Python</title>
		<link>https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/</link>
					<comments>https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Mon, 01 Jun 2020 17:20:46 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Keras]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Neural Networks]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Recurrent Neural Networks]]></category>
		<category><![CDATA[Stock Market Forecasting]]></category>
		<category><![CDATA[Tensorflow]]></category>
		<category><![CDATA[Time Series Forecasting]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[Advanced Tutorials]]></category>
		<category><![CDATA[AI in Finance]]></category>
		<category><![CDATA[Deep Learning]]></category>
		<category><![CDATA[Multivariate Models]]></category>
		<category><![CDATA[Stock Market Prediction]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=1815</guid>

					<description><![CDATA[<p>Regression models based on recurrent neural networks (RNN) can recognize patterns in time series data, making them an exciting technology for stock market forecasting. What distinguishes these RNNs from traditional neural networks is their architecture. It consists of multiple layers of long-term, short-term memory (LSTM). These LSTM layers allow the model to learn patterns in ... <a title="Stock Market Prediction using Multivariate Time Series and Recurrent Neural Networks in Python" class="read-more" href="https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/" aria-label="Read more about Stock Market Prediction using Multivariate Time Series and Recurrent Neural Networks in Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/">Stock Market Prediction using Multivariate Time Series and Recurrent Neural Networks in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Regression models based on recurrent neural networks (RNN) can recognize patterns in time series data, making them an exciting technology for stock market forecasting. What distinguishes these RNNs from traditional neural networks is their architecture. It consists of multiple layers of long-term, short-term memory (LSTM). These LSTM layers allow the model to learn patterns in a time series that occur over different periods and are often difficult for human analysts to detect. We can train such models with one feature (<a href="https://www.relataly.com/univariate-stock-market-forecasting-using-a-recurrent-neural-network/122/" target="_blank" rel="noreferrer noopener">univariate forecasting models</a>) or multiple features (multivariate models). Multivariate Models can take more data into account, and if we provide them with relevant features, they can make better predictions. This tutorial uses Python and Keras to implement a multivariate RNN for stock price prediction. We define the architecture of our regression model and then train this model to predict the NASDAQ index.</p>



<p class="wp-block-paragraph">The remainder of this tutorial proceeds in two parts: We start with a brief intro in which we compare modeling univariate and multivariate time series data. Then we turn to the hands-on part, in which we prepare the multivariate time series data and use it to train a neural network in Python. The model is a recurrent neural network with LSTM layers that forecasts the NASDAQ stock market index. Finally, we evaluate the performance of our model and make a forecast for the next day.</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<div class="wp-block-kadence-infobox kt-info-box_317393-a1"><span class="kt-blocks-info-box-link-wrap info-box-link kt-blocks-info-box-media-align-top kt-info-halign-left"><div class="kt-infobox-textcontent"><h2 class="kt-blocks-info-box-title">Disclaimer</h2><p class="kt-blocks-info-box-text">This article does not constitute financial advice. Stock markets can be very volatile and are generally difficult to predict. Predictive models and other forms of analytics applied in this article only serve the purpose of illustrating machine learning use cases.</p></div></span></div>
</div></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="alignright size-large"><img decoding="async" width="1024" height="1024" data-attachment-id="12505" data-permalink="https://www.relataly.com/business-use-cases-for-openai-gpt-models-chatgpt-davinci/12200/blockchain-bull-cryptocurrencies-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/blockchain-bull-cryptocurrencies-min.png" data-orig-size="1024,1024" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="blockchain bull cryptocurrencies-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/blockchain-bull-cryptocurrencies-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/blockchain-bull-cryptocurrencies-min-1024x1024.png" alt="" class="wp-image-12505" srcset="https://www.relataly.com/wp-content/uploads/2023/02/blockchain-bull-cryptocurrencies-min.png 1024w, https://www.relataly.com/wp-content/uploads/2023/02/blockchain-bull-cryptocurrencies-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/blockchain-bull-cryptocurrencies-min.png 140w, https://www.relataly.com/wp-content/uploads/2023/02/blockchain-bull-cryptocurrencies-min.png 768w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Stock market forecasting has become an exciting application for recurrent neural networks.</figcaption></figure>
</div></div>
</div>



<h2 class="wp-block-heading" id="h-univariate-vs-multivariate-time-series-models">Univariate vs. Multivariate Time Series Models</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Multivariate models and univariate models differ in the number of their input features. While univariate models consider only a single feature, multivariate models use several input variables (features). In stock market forecasting, we can create additional features from price history. Examples are performance indicators such as moving averages, the RSI, or the Sales Volume. We can also include features from other sources, for example, social media sentiment, weather forecasts, etc. Multivariate models that have additional relevant information available have a chance to outperform univariate models. However, this is only true if the features are relevant and are indicative of future price movements.</p>



<p class="wp-block-paragraph">Preparing data for training univariate models is more straightforward than for multivariate models. If you are new to time series prediction, you might want to look at my earlier articles. These explain how to develop and evaluate univariate time series models:</p>



<ul class="wp-block-list">
<li><a href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/003%20Time%20Series%20Forecasting%20-%20Univariate%20Model%20using%20Recurrent%20Neural%20Networks.ipynb" target="_blank" rel="noreferrer noopener">Stock Market Forecasting using Univariate Models and Python</a></li>



<li><a href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/005%20Time%20Series%20Forecasting%20-%20Multi-step%20Rolling%20Forecasting.ipynb" target="_blank" rel="noreferrer noopener">Multi-step Time Series Forecasting with Python: Step-by-Step Guide</a></li>



<li><a href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/004%20Time%20Series%20Forecasting%20-%20Adjusting%20Prediction%20Intervals.ipynb" target="_blank" rel="noreferrer noopener">Stock Market Prediction – Adjusting Time Series Prediction Intervals</a></li>



<li><a href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/009%20Time%20Series%20Forecasting%20-%20Measuring%20Model%20Performance.ipynb" target="_blank" rel="noreferrer noopener">Evaluating Time Series Forecasting Models with Python</a></li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h4 class="wp-block-heading" id="h-univariate-prediction-models">Univariate Prediction Models</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In time series regression, the standard approach is to train a model using past values from the time series that need to be predicted. The assumption is that the value of a time series at time t is closely related to the previous time steps t-1, t-2, t-3, and so on. This approach is similar to chart analysis, which involves identifying patterns in a price chart that can indicate future movements. Both approaches rely on the ability to identify recurring patterns in the data and make accurate predictions based on them. The performance of the model or analysis depends on the ability to identify these patterns and draw the right conclusions from them.</p>



<p class="wp-block-paragraph">Several techniques can be used to improve the performance of time series regression models, including feature engineering, hyperparameter optimization, and ensemble methods. In addition to these techniques, it is also important to carefully evaluate the performance of the model using appropriate metrics, such as mean squared error or mean absolute error, and to continuously monitor the model&#8217;s performance to ensure it remains accurate over time.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="7379" data-permalink="https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/image-19-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-19.png" data-orig-size="768,530" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-19" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-19.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-19.png" alt="univariate time series modelling, recurrent neural networks, keras, python, tutorials, stock market prediction" class="wp-image-7379" width="347" height="240" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-19.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-19.png 300w" sizes="(max-width: 347px) 100vw, 347px" /><figcaption class="wp-element-caption">Univariate Time Series Prediction</figcaption></figure>
</div>
</div>



<h4 class="wp-block-heading" id="h-multivariate-prediction-models">Multivariate Prediction Models</h4>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Predicting the price of a financial asset is a challenging task due to the numerous variables that can influence it, including economic cycles, political events, unforeseen occurrences, psychological factors, market sentiment, and even the weather. These variables are often interdependent, which makes statistical modeling even more complex. While multivariate models can take into account several factors, they are still a simplification of reality and may not fully capture the complexity of the market. On the other hand, univariate models only consider a single dependent variable, ignoring the other dimensions.</p>



<p class="wp-block-paragraph">Even with good features, predicting financial prices can be difficult because patterns and market rules may change frequently. As a result, models may make mistakes. However, as <a href="https://en.wikipedia.org/wiki/All_models_are_wrong" target="_blank" rel="noreferrer noopener">Georg Box</a> famously said, &#8220;All models are wrong, but some are useful.&#8221; Despite their limitations, multivariate models can provide a more detailed representation of reality compared to univariate models, and can still be useful in forecasting financial prices.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="7378" data-permalink="https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/image-15/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-15.png" data-orig-size="1076,524" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-15" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-15.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-15-1024x499.png" alt="multivariate time series modelling, recurrent neural networks, keras, python, tutorials, stock market prediction" class="wp-image-7378" width="356" height="174" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-15.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/image-15.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/image-15.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-15.png 1076w" sizes="(max-width: 356px) 100vw, 356px" /><figcaption class="wp-element-caption">Multivariate Time Series Prediction</figcaption></figure>
</div>
</div>



<div style="height:8px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading" id="h-implementing-a-multivariate-time-series-prediction-model-in-python">Implementing a Multivariate Time Series Prediction Model in Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Now that we have a solid understanding of multivariate time series forecasting, it&#8217;s time to put our knowledge into practice by building a model using Python and TensorFlow. Specifically, we will create a multivariate recurrent neural network (RNN) to predict the NASDAQ stock market index. RNNs are well-suited for time series forecasting because they can process sequential data, considering the dependencies between past and future events. </p>



<p class="wp-block-paragraph">To build our RNN model, we will need to go through several essential steps. </p>



<ol class="wp-block-list">
<li>Creating features and scaling: This involves loading and preparing the time series data for modeling, including selecting the time period and relevant features and scaling the data.</li>



<li>Splitting the data: We split the data into train and test sets.</li>



<li>Sliding window approach: The time series data is sliced into mini-batches using the sliding window approach.</li>



<li>Model design and training: The appropriate architecture for the RNN model is chosen, and an optimization algorithm is used to adjust the model&#8217;s weights and biases to minimize prediction error.</li>



<li>Model validation and predictions: We evaluate the model&#8217;s performance by comparing the predicted values to the actual values of the NASDAQ index. In addition, we will use the model to make predictions about future events. </li>



<li>Unscaling the predictions: The predictions are unscaled to bring them back to their original scale.</li>
</ol>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_332b7d-0a"><a class="kb-button kt-button button kb-btn_46b0bc-a4 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/01%20Time%20Series%20Forecasting%20%26%20Regression/007%20Multivariate%20Time%20Series%20Forecasting%20with%20RNNs.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_559184-98 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly Github Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="7446" data-permalink="https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/image-20/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/image-20.png" data-orig-size="1807,1911" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-20" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/image-20.png" src="https://www.relataly.com/wp-content/uploads/2022/04/image-20-968x1024.png" alt="Six essential steps of training a multivariate recurrent neural network for time series prediction, stock market forecasting, Python, Keras, splitting, slicing, RNN architecture, multivariate time series modelling" class="wp-image-7446" width="735" height="777" srcset="https://www.relataly.com/wp-content/uploads/2022/04/image-20.png 968w, https://www.relataly.com/wp-content/uploads/2022/04/image-20.png 284w, https://www.relataly.com/wp-content/uploads/2022/04/image-20.png 768w, https://www.relataly.com/wp-content/uploads/2022/04/image-20.png 1452w, https://www.relataly.com/wp-content/uploads/2022/04/image-20.png 1807w" sizes="(max-width: 735px) 100vw, 735px" /><figcaption class="wp-element-caption">Six Essential Steps for Developing a Multivariate Time Series Model</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before starting the coding part, make sure that you have set up your <a href="https://www.python.org/downloads/" target="_blank" rel="noreferrer noopener">Python 3</a> environment and required packages. If you don&#8217;t have a Python environment, follow the steps in&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>.</p>



<p class="wp-block-paragraph">Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><a href="https://docs.python.org/3/library/math.html" target="_blank" rel="noreferrer noopener">math</a></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using <em><a href="https://keras.io/" target="_blank" rel="noreferrer noopener">Keras&nbsp;</a></em>(2.0 or higher) with <a href="https://www.tensorflow.org/" target="_blank" rel="noreferrer noopener"><em>Tensorflow</em> </a>backend, the machine learning library <a href="https://scikit-learn.org/stable/" target="_blank" rel="noreferrer noopener">sci-kit-learn</a>, and the <a href="https://pandas-datareader.readthedocs.io/en/latest/" target="_blank" rel="noreferrer noopener">pandas-DataReader</a>. </p>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-step-1-load-the-time-series-data">Step #1 Load the Time Series Data</h3>



<p class="wp-block-paragraph">Let&#8217;s start by loading price data on the <a href="https://en.wikipedia.org/wiki/Nasdaq" target="_blank" rel="noreferrer noopener">NASDAQ</a> composite index <strong>(symbol: ^IXIC)</strong> from <a href="https://finance.yahoo.com/" target="_blank" rel="noreferrer noopener">yahoo.finance.com</a> into our Python project. To download the data, we use <a href="https://pandas-datareader.readthedocs.io/en/latest/" target="_blank" rel="noreferrer noopener">Pandas DataReader</a> &#8211; a popular Python library that provides functions to extract data from various sources on the web. Alternatively, you can also use the &#8220;yfinance&#8221; library.</p>



<p class="wp-block-paragraph">We provide the technical symbol for the NASDAQ index, &#8220;^IXIC.&#8221; Alternatively, you could use other asset symbols, for example, BTC-USD, to get price quotes for Bitcoin. In addition, we limit the data in the API request to the timeframe between 2010-01-01 and the current date.</p>



<p class="wp-block-paragraph">Running the code below will load the data into a new DataFrame object. Be aware that input data and predictions will vary depending on when you execute the code.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Time Series Forecasting - Multivariate Time Series Models for Stock Market Prediction

import math # Mathematical functions 
import numpy as np # Fundamental package for scientific computing with Python
import pandas as pd # Additional functions for analysing and manipulating data
from datetime import date, timedelta, datetime # Date Functions
from pandas.plotting import register_matplotlib_converters # This function adds plotting functions for calender dates
import matplotlib.pyplot as plt # Important package for visualization - we use this to plot the market data
import matplotlib.dates as mdates # Formatting dates
import tensorflow as tf
from sklearn.metrics import mean_absolute_error, mean_squared_error # Packages for measuring model performance / errors
from tensorflow.keras import Sequential # Deep learning library, used for neural networks
from tensorflow.keras.layers import LSTM, Dense, Dropout # Deep learning classes for recurrent and regular densely-connected layers
from tensorflow.keras.callbacks import EarlyStopping # EarlyStopping during model training
from sklearn.preprocessing import RobustScaler, MinMaxScaler # This Scaler removes the median and scales the data according to the quantile range to normalize the price data 
import seaborn as sns # Visualization
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})

# check the tensorflow version and the number of available GPUs
print('Tensorflow Version: ' + tf.__version__)
physical_devices = tf.config.list_physical_devices('GPU')
print(&quot;Num GPUs:&quot;, len(physical_devices))

# Setting the timeframe for the data extraction
end_date =  date.today().strftime(&quot;%Y-%m-%d&quot;)
start_date = '2010-01-01'

# Getting NASDAQ quotes
stockname = 'NASDAQ'
symbol = '^IXIC'

# You can either use webreader or yfinance to load the data from yahoo finance
# import pandas_datareader as webreader
# df = webreader.DataReader(symbol, start=start_date, end=end_date, data_source=&quot;yahoo&quot;)

import yfinance as yf #Alternative package if webreader does not work: pip install yfinance
df = yf.download(symbol, start=start_date, end=end_date)

# Create a quick overview of the dataset
df.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Tensorflow Version: 2.5.0
Num GPUs: 1
[*********************100%***********************]  1 of 1 completed
			Open		High		Low			Close		Adj Close	Volume
Date						
2009-12-31	2292.919922	2293.590088	2269.110107	2269.149902	2269.149902	1237820000
2010-01-04	2294.409912	2311.149902	2294.409912	2308.419922	2308.419922	1931380000
2010-01-05	2307.270020	2313.729980	2295.620117	2308.709961	2308.709961	2367860000
2010-01-06	2307.709961	2314.070068	2295.679932	2301.090088	2301.090088	2253340000
2010-01-07	2298.090088	2301.300049	2285.219971	2300.050049	2300.050049	2270050000</pre></div>



<p class="wp-block-paragraph">The data looks as expected and has the following columns:</p>



<ul class="wp-block-list">
<li>High &#8211; the daily high</li>



<li>Low &#8211; the daily low</li>



<li>Open &#8211; the opening price</li>



<li>Close &#8211; the closing price </li>



<li>Volume &#8211; the daily trading volume</li>



<li>Adj Close &#8211; the adjacent closing price</li>
</ul>



<h3 class="wp-block-heading" id="h-step-2-explore-the-data">Step #2 Explore the Data</h3>



<p class="wp-block-paragraph">Let&#8217;s first familiarize ourselves with the data before processing them further. Line plots are an excellent choice to gain a quick overview of time series data. By running the code below, we loop over the columns to plot a line chart for each column of the dataframe. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Plot line charts
df_plot = df.copy()

ncols = 2
nrows = int(round(df_plot.shape[1] / ncols, 0))

fig, ax = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, figsize=(14, 7))
for i, ax in enumerate(fig.axes):
        sns.lineplot(data = df_plot.iloc[:, i], ax=ax)
        ax.tick_params(axis=&quot;x&quot;, rotation=30, labelsize=10, length=0)
        ax.xaxis.set_major_locator(mdates.AutoDateLocator())
fig.tight_layout()
plt.show()</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="1000" height="496" data-attachment-id="11767" data-permalink="https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/image-18/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-18.png" data-orig-size="1000,496" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-18" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-18.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-18.png" alt="chart of the NASDAQ index created with python" class="wp-image-11767" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-18.png 1000w, https://www.relataly.com/wp-content/uploads/2022/12/image-18.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-18.png 768w" sizes="(max-width: 1000px) 100vw, 1000px" /></figure>



<p class="wp-block-paragraph">The line plots look as expected. We continue with preprocessing and feature engineering. </p>



<h3 class="wp-block-heading" id="h-step-3-feature-selection-and-scaling">Step #3 Feature Selection and Scaling</h3>



<p class="wp-block-paragraph">Before we can train the neural network, we need to transform the data into a processable shape. In this section, we perform the following tasks:</p>



<ul class="wp-block-list">
<li>Selecting features</li>



<li>Scaling the data to a standard value range</li>
</ul>



<h4 class="wp-block-heading">3.1 Selecting Features</h4>



<p class="wp-block-paragraph">First, we will select the features upon which we want to train our neural network. The selection and engineering of relevant feature variables is a complex topic. We could also create additional features such as moving averages, but I want to keep things simple. Therefore, we select features that are already present in our data. To learn more about feature engineering for stock market prediction, check out the<a href="https://www.relataly.com/feature-engineering-for-multivariate-time-series-models-with-python/1813/" target="_blank" rel="noreferrer noopener"> relataly feature engineering tutorial</a>.</p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="1886" data-permalink="https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/feature-selection/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/05/Feature-Selection.png" data-orig-size="1596,759" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Feature-Selection" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/05/Feature-Selection.png" src="https://www.relataly.com/wp-content/uploads/2020/05/Feature-Selection-1024x487.png" alt="Illustration how we preprocess the stock market data, before we use them to train a multivariate time series regression model" class="wp-image-1886" width="695" height="330" srcset="https://www.relataly.com/wp-content/uploads/2020/05/Feature-Selection.png 1024w, https://www.relataly.com/wp-content/uploads/2020/05/Feature-Selection.png 300w, https://www.relataly.com/wp-content/uploads/2020/05/Feature-Selection.png 768w, https://www.relataly.com/wp-content/uploads/2020/05/Feature-Selection.png 1536w, https://www.relataly.com/wp-content/uploads/2020/05/Feature-Selection.png 1596w" sizes="(max-width: 695px) 100vw, 695px" /><figcaption class="wp-element-caption">Feature Selection of Multivariate Time Series Models</figcaption></figure>



<p class="wp-block-paragraph">Running the code below selects the features. We add a dummy column to our record called &#8220;Predictions,&#8221; which will help us later when we need to reverse the scaling of our data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Indexing Batches
train_df = df.sort_values(by=['Date']).copy()

# List of considered Features
FEATURES = ['High', 'Low', 'Open', 'Close', 'Volume'
            #, 'Month', 'Year', 'Adj Close'
           ]

print('FEATURE LIST')
print([f for f in FEATURES])

# Create the dataset with features and filter the data to the list of FEATURES
data = pd.DataFrame(train_df)
data_filtered = data[FEATURES]

# We add a prediction column and set dummy values to prepare the data for scaling
data_filtered_ext = data_filtered.copy()
data_filtered_ext['Prediction'] = data_filtered_ext['Close']

# Print the tail of the dataframe
data_filtered_ext.tail()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">FEATURE LIST
['High', 'Low', 'Open', 'Close', 'Volume']
			High			Low				Open			Close			Volume		Prediction
Date						
2022-05-09	11990.610352	11574.940430	11923.030273	11623.250000	5911380000	11623.250000
2022-05-10	11944.940430	11566.280273	11900.339844	11737.669922	6199090000	11737.669922
2022-05-11	11844.509766	11339.179688	11645.570312	11364.240234	6120860000	11364.240234
2022-05-12	11547.330078	11108.759766	11199.250000	11370.959961	6647400000	11370.959961
2022-05-13	11856.709961	11510.259766	11555.969727	11805.000000	5868610000	11805.000000</pre></div>



<h4 class="wp-block-heading" id="h-3-2-scaling-the-multivariate-input-data">3.2 Scaling the Multivariate Input Data</h4>



<p class="wp-block-paragraph">Another necessary step in data preparation for neural networks is scaling the input data. Scaling will increase training times and improve model accuracy. The scikit-learn package offers different scaling approaches. We use the MinMaxScaler to scale the input data to a range between 0 and 1.</p>



<p class="wp-block-paragraph">A model that is trained on scaled data will also produce scaled predictions. Therefore, when we make predictions later with our model, we must not forget to scale the predictions back. The scaler_model will adapt to the shape of the data (6-dimensional). However, our predictions will be one-dimensional. Because the scaler has a fixed input shape, we cannot simply reuse it for unscaling our model predictions. To unscale the predictions later, we create an additional scaler that works on a single feature column (scaler_pred).</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Get the number of rows in the data
nrows = data_filtered.shape[0]

# Convert the data to numpy values
np_data_unscaled = np.array(data_filtered)
np_data = np.reshape(np_data_unscaled, (nrows, -1))
print(np_data.shape)

# Transform the data by scaling each feature to a range between 0 and 1
scaler = MinMaxScaler()
np_data_scaled = scaler.fit_transform(np_data_unscaled)

# Creating a separate scaler that works on a single column for scaling predictions
scaler_pred = MinMaxScaler()
df_Close = pd.DataFrame(data_filtered_ext['Close'])
np_Close_scaled = scaler_pred.fit_transform(df_Close)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Out: (2619, 6)</pre></div>



<h3 class="wp-block-heading" id="h-step-4-transforming-the-multivariate-data">Step #4 Transforming the Multivariate Data</h3>



<p class="wp-block-paragraph">Next, we train our multivariate regression model based on a three-dimensional data structure. The first dimension is the sequences, the second dimension is the time steps (mini-batches), and the third dimension is the features. The illustration below shows the steps to bring the multivariate data into a shape our neural model can process during training. We must keep this form and perform the same steps when using the model to create a forecast. </p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">An essential step in the preparation process is slicing the data into multiple input data sequences with associated target values. We write a simple Python script that uses a &#8220;sliding window.&#8221; This approach moves a window through the time series data, adding a sequence of multiple data points to the input data with each step. The target value (e.g., Closing Price) follows this sequence, and we store it in a separate target dataset. Then we push the window one step further and repeat these activities. This process results in a data set with many input sequences (mini-batches), each with a corresponding target value in the target record. This process applies both to the training and the test data.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="1943" data-permalink="https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/image-4-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/06/image-4.png" data-orig-size="852,422" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-4" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/06/image-4.png" src="https://www.relataly.com/wp-content/uploads/2020/06/image-4.png" alt="Sliding window approach to partition multivariate data for time series forecasting" class="wp-image-1943" width="469" height="232" srcset="https://www.relataly.com/wp-content/uploads/2020/06/image-4.png 852w, https://www.relataly.com/wp-content/uploads/2020/06/image-4.png 300w, https://www.relataly.com/wp-content/uploads/2020/06/image-4.png 768w" sizes="(max-width: 469px) 100vw, 469px" /><figcaption class="wp-element-caption">Sliding Window</figcaption></figure>
</div>
</div>



<p class="wp-block-paragraph">We will apply the sliding window approach to our data. The result is a training set (x_train) containing 2258 input sequences, each with 50 steps and six features. The related target dataset (y_train) has 2258 target values. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Set the sequence length - this is the timeframe used to make a single prediction
sequence_length = 50

# Prediction Index
index_Close = data.columns.get_loc(&quot;Close&quot;)

# Split the training data into train and train data sets
# As a first step, we get the number of rows to train the model on 80% of the data 
train_data_len = math.ceil(np_data_scaled.shape[0] * 0.8)

# Create the training and test data
train_data = np_data_scaled[0:train_data_len, :]
test_data = np_data_scaled[train_data_len - sequence_length:, :]

# The RNN needs data with the format of [samples, time steps, features]
# Here, we create N samples, sequence_length time steps per sample, and 6 features
def partition_dataset(sequence_length, data):
    x, y = [], []
    data_len = data.shape[0]
    for i in range(sequence_length, data_len):
        x.append(data[i-sequence_length:i,:]) #contains sequence_length values 0-sequence_length * columsn
        y.append(data[i, index_Close]) #contains the prediction values for validation,  for single-step prediction
    
    # Convert the x and y to numpy arrays
    x = np.array(x)
    y = np.array(y)
    return x, y

# Generate training data and test data
x_train, y_train = partition_dataset(sequence_length, train_data)
x_test, y_test = partition_dataset(sequence_length, test_data)

# Print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

# Validate that the prediction value and the input match up
# The last close price of the second input sample should equal the first prediction value
print(x_train[1][sequence_length-1][index_Close])
print(y_train[0])</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">(2474, 50, 5) (2474,)
(630, 50, 5) (630,)
0.02049456793614579
0.02049456793614579</pre></div>



<h3 class="wp-block-heading" id="h-step-5-train-the-multivariate-prediction-model">Step #5 Train the Multivariate Prediction Model</h3>



<p class="wp-block-paragraph">Once we have the data prepared and ready, we can train our model. The architecture of our neural network consists of the following four layers:</p>



<ul class="wp-block-list">
<li>An LSTM layer, which takes our mini-batches as input and returns the whole sequence</li>



<li>Another LSTM layer that takes the sequence from the previous layer but only returns five values</li>



<li>Dense layer with five neurons</li>



<li>A final dense layer that outputs the predicted value</li>
</ul>



<p class="wp-block-paragraph">The number of neurons in the first layer must equal the size of a minibatch of the input data. Each minibatch in our dataset consists of a matrix with 50 steps and six features. Thus, the input layer of our recurrent neural network consists of 300 neurons. Keeping this architecture in mind is essential because, later, we need to bring the data into the same shape when we want to predict a new dataset. Running the code below creates the model architecture and compiles the model.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Configure the neural network model
model = Sequential()

# Model with n_neurons = inputshape Timestamps, each with x_train.shape[2] variables
n_neurons = x_train.shape[1] * x_train.shape[2]
print(n_neurons, x_train.shape[1], x_train.shape[2])
model.add(LSTM(n_neurons, return_sequences=True, input_shape=(x_train.shape[1], x_train.shape[2]))) 
model.add(LSTM(n_neurons, return_sequences=False))
model.add(Dense(5))
model.add(Dense(1))

# Compile the model
model.compile(optimizer='adam', loss='mse')</pre></div>



<p class="wp-block-paragraph">Running the code below starts the training process.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Training the model
epochs = 50
batch_size = 16
early_stop = EarlyStopping(monitor='loss', patience=5, verbose=1)
history = model.fit(x_train, y_train, 
                    batch_size=batch_size, 
                    epochs=epochs,
                    validation_data=(x_test, y_test)
                   )
                    
                    #callbacks=[early_stop])</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Output exceeds the size limit. Open the full output data in a text editor
Epoch 1/50
155/155 [==============================] - 6s 19ms/step - loss: 6.7374e-04 - val_loss: 0.0011
Epoch 2/50
155/155 [==============================] - 2s 13ms/step - loss: 6.6207e-05 - val_loss: 8.4700e-04
Epoch 3/50
155/155 [==============================] - 2s 14ms/step - loss: 5.0667e-05 - val_loss: 6.6467e-04
Epoch 4/50
155/155 [==============================] - 2s 14ms/step - loss: 5.8446e-05 - val_loss: 6.8575e-04
Epoch 5/50
155/155 [==============================] - 2s 14ms/step - loss: 4.8430e-05 - val_loss: 8.4892e-04
Epoch 6/50
155/155 [==============================] - 2s 14ms/step - loss: 7.1283e-05 - val_loss: 8.2255e-04
Epoch 7/50
155/155 [==============================] - 2s 15ms/step - loss: 6.0554e-05 - val_loss: 8.0583e-04
Epoch 8/50
155/155 [==============================] - 2s 15ms/step - loss: 5.5977e-05 - val_loss: 4.5830e-04
Epoch 9/50
155/155 [==============================] - 2s 15ms/step - loss: 4.2453e-05 - val_loss: 6.1866e-04
Epoch 10/50
155/155 [==============================] - 2s 14ms/step - loss: 3.5722e-05 - val_loss: 4.5288e-04
Epoch 11/50
155/155 [==============================] - 2s 14ms/step - loss: 4.1409e-05 - val_loss: 8.5975e-04
Epoch 12/50
155/155 [==============================] - 3s 18ms/step - loss: 7.0007e-05 - val_loss: 5.0300e-04
Epoch 13/50
...
Epoch 49/50
155/155 [==============================] - 2s 13ms/step - loss: 2.7064e-05 - val_loss: 2.8202e-04
Epoch 50/50
155/155 [==============================] - 2s 13ms/step - loss: 2.9009e-05 - val_loss: 2.5486e-04</pre></div>



<p class="wp-block-paragraph">Let&#8217;s take a quick look at the loss curve. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Plot training &amp; validation loss values
fig, ax = plt.subplots(figsize=(16, 5), sharex=True)
sns.lineplot(data=history.history[&quot;loss&quot;])
plt.title(&quot;Model loss&quot;)
plt.ylabel(&quot;Loss&quot;)
plt.xlabel(&quot;Epoch&quot;)
ax.xaxis.set_major_locator(plt.MaxNLocator(epochs))
plt.legend([&quot;Train&quot;, &quot;Test&quot;], loc=&quot;upper left&quot;)
plt.grid()
plt.show()</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="521" data-attachment-id="5051" data-permalink="https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/image-74-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-74.png" data-orig-size="1188,604" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-74" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-74.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-74-1024x521.png" alt="Loss curve after training the recurrent neural network for stock market prediction" class="wp-image-5051" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-74.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-74.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-74.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-74.png 1188w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">The loss drops quickly to a lower plateau, which signals that the model has improved throughout the training process. </p>



<h3 class="wp-block-heading" id="h-step-6-evaluate-model-performance">Step #6 Evaluate Model Performance</h3>



<p class="wp-block-paragraph">Once we have trained the neural network regression model, we want to measure its performance. As mentioned in section 3, we first have to reverse the scaling of the predictions. Afterward, we calculate different error metrics, MAE, MAPE, and MDAPE. Then we will compare the predictions in a line plot with the actual values. For more information on measuring the performance of regression models, see <a href="https://www.relataly.com/regression-error-metrics-python/923/" target="_blank" rel="noreferrer noopener">this relataly article</a>.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Get the predicted values
y_pred_scaled = model.predict(x_test)

# Unscale the predicted values
y_pred = scaler_pred.inverse_transform(y_pred_scaled)
y_test_unscaled = scaler_pred.inverse_transform(y_test.reshape(-1, 1))

# Mean Absolute Error (MAE)
MAE = mean_absolute_error(y_test_unscaled, y_pred)
print(f'Median Absolute Error (MAE): {np.round(MAE, 2)}')

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled))) * 100
print(f'Mean Absolute Percentage Error (MAPE): {np.round(MAPE, 2)} %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(y_test_unscaled, y_pred)/ y_test_unscaled)) ) * 100
print(f'Median Absolute Percentage Error (MDAPE): {np.round(MDAPE, 2)} %')</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Median Absolute Error (MAE): 175.28
Mean Absolute Percentage Error (MAPE): 1.48 %
Median Absolute Percentage Error (MDAPE): 1.16 %</pre></div>



<p class="wp-block-paragraph">The MAPE is 22.15, which means that the mean of our predictions deviates from the actual values by 3.12%. The MDAPE is 2.88 % and a bit lower than the mean, thus indicating there are some outliers among the prediction errors. 50% of the predictions deviate by more than 2.88%, and 50% differ by less than 2.88% from the actual values.</p>



<p class="wp-block-paragraph">Next, we create a line plot showing the forecast and compare it to the actual values. Adding a bar plot to the chart helps highlight the deviations of the predictions from the actual values. Running the code below creates the line plot.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># The date from which on the date is displayed
display_start_date = &quot;2019-01-01&quot; 

# Add the difference between the valid and predicted prices
train = pd.DataFrame(data_filtered_ext['Close'][:train_data_len + 1]).rename(columns={'Close': 'y_train'})
valid = pd.DataFrame(data_filtered_ext['Close'][train_data_len:]).rename(columns={'Close': 'y_test'})
valid.insert(1, &quot;y_pred&quot;, y_pred, True)
valid.insert(1, &quot;residuals&quot;, valid[&quot;y_pred&quot;] - valid[&quot;y_test&quot;], True)
df_union = pd.concat([train, valid])

# Zoom in to a closer timeframe
df_union_zoom = df_union[df_union.index &gt; display_start_date]

# Create the lineplot
fig, ax1 = plt.subplots(figsize=(16, 8))
plt.title(&quot;y_pred vs y_test&quot;)
plt.ylabel(stockname, fontsize=18)
sns.set_palette([&quot;#090364&quot;, &quot;#1960EF&quot;, &quot;#EF5919&quot;])
sns.lineplot(data=df_union_zoom[['y_pred', 'y_train', 'y_test']], linewidth=1.0, dashes=False, ax=ax1)

# Create the bar plot with the differences
df_sub = [&quot;#2BC97A&quot; if x &gt; 0 else &quot;#C92B2B&quot; for x in df_union_zoom[&quot;residuals&quot;].dropna()]
ax1.bar(height=df_union_zoom['residuals'].dropna(), x=df_union_zoom['residuals'].dropna().index, width=3, label='residuals', color=df_sub)
plt.legend()
plt.show()</pre></div>



<figure class="wp-block-image size-full"><img decoding="async" width="964" height="492" data-attachment-id="8621" data-permalink="https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/output-2-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/output-2.png" data-orig-size="964,492" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="output-2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/output-2.png" src="https://www.relataly.com/wp-content/uploads/2022/05/output-2.png" alt="line plot that shows the stock market forecast that we have generated with the multivariate time series model, python tutorial" class="wp-image-8621" srcset="https://www.relataly.com/wp-content/uploads/2022/05/output-2.png 964w, https://www.relataly.com/wp-content/uploads/2022/05/output-2.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/output-2.png 768w" sizes="(max-width: 964px) 100vw, 964px" /></figure>



<p class="wp-block-paragraph">The line plot shows that the forecast is close to the actual values but partially deviates from it. The deviations between actual values and predictions are called residuals. For our mode, they seem to be most significant during periods of increased market volatility and least during periods of steady market movement, which makes sense because sudden movements are generally more difficult to predict.</p>



<h3 class="wp-block-heading" id="h-step-7-predict-the-next-day-s-price">Step #7 Predict the Next Day&#8217;s Price</h3>



<p class="wp-block-paragraph">After training the neural network, we want to forecast the stock market for the next day. For this purpose, we extract a new dataset from the Yahoo-Finance API and preprocess it as we did for model training. </p>



<p class="wp-block-paragraph">We trained our model with mini-batches of 50-steps and six features. Thus, we must also provide the model with 50-steps when making the forecast. As before, we transform the data into the shape of 1 x 50 x 6, whereby the last figure is the number of feature columns. After generating the forecast, we unscale the stock market predictions back to the original range of values. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">df_temp = df[-sequence_length:]
new_df = df_temp.filter(FEATURES)

N = sequence_length

# Get the last N day closing price values and scale the data to be values between 0 and 1
last_N_days = new_df[-sequence_length:].values
last_N_days_scaled = scaler.transform(last_N_days)

# Create an empty list and Append past N days
X_test_new = []
X_test_new.append(last_N_days_scaled)

# Convert the X_test data set to a numpy array and reshape the data
pred_price_scaled = model.predict(np.array(X_test_new))
pred_price_unscaled = scaler_pred.inverse_transform(pred_price_scaled.reshape(-1, 1))

# Print last price and predicted price for the next day
price_today = np.round(new_df['Close'][-1], 2)
predicted_price = np.round(pred_price_unscaled.ravel()[0], 2)
change_percent = np.round(100 - (price_today * 100)/predicted_price, 2)

plus = '+'; minus = ''
print(f'The close price for {stockname} at {end_date} was {price_today}')
print(f'The predicted close price is {predicted_price} ({plus if change_percent &gt; 0 else minus}{change_percent}%)')</pre></div>



<p class="wp-block-paragraph">The close price for NASDAQ on 2021-06-27 was 14360.39. The predicted closing price is 14232.8095703125 (-0.9%) </p>



<div style="height:31px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">This tutorial has shown multivariate time series modeling for stock market prediction in Python. We trained a neural network regression model for predicting the NASDAQ index. Before training our model, we performed several steps to prepare the data. The steps included splitting the data and scaling them. In addition, we created and tested various new features from the original time series data to account for the multivariate modeling approach. You now have the knowledge and code to conduct further experiments with the features of your choice.  </p>



<p class="wp-block-paragraph">Multivariate time series forecasting is a complex topic. You might want to take the time to retrace the different steps. Especially the transformation of the data can be challenging. The best way to learn is to practice. Therefore I encourage you to develop more time series models and experiment with other data sources.</p>



<p class="wp-block-paragraph">Another interesting approach to stock market prediction uses candlestick images and convolutional neural networks. If this topic interests you, check out the following article: <a href="http://deep%20reinforcement%20learning%20stock%20market%20trading%2C%20utilizing%20a%20cnn%20with%20candlestick%20imageshttps//journals.plos.org/plosone/article?id=10.1371/journal.pone.0263181" target="_blank" rel="noreferrer noopener">Deep reinforcement learning stock market trading, utilizing a CNN with candlestick images</a></p>



<p class="wp-block-paragraph">I am always trying to learn and improve. If you want to give feedback or have remarks, feel free to share them in the comments.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="512" height="512" data-attachment-id="12779" data-permalink="https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/scatterplot-python-machine-learning-relataly-midjourney-lineplot-financial-data-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2020/06/scatterplot-python-machine-learning-relataly-midjourney-lineplot-financial-data-min.png" data-orig-size="1024,1024" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="scatterplot python machine learning relataly midjourney lineplot financial data-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2020/06/scatterplot-python-machine-learning-relataly-midjourney-lineplot-financial-data-min.png" src="https://www.relataly.com/wp-content/uploads/2020/06/scatterplot-python-machine-learning-relataly-midjourney-lineplot-financial-data-min-512x512.png" alt="Stockmarket forecasting with a neural network is about identifying meaningful patterns, but there is no guarantee that these patterns are present. Image created with Midjourney.  " class="wp-image-12779" srcset="https://www.relataly.com/wp-content/uploads/2020/06/scatterplot-python-machine-learning-relataly-midjourney-lineplot-financial-data-min.png 512w, https://www.relataly.com/wp-content/uploads/2020/06/scatterplot-python-machine-learning-relataly-midjourney-lineplot-financial-data-min.png 300w, https://www.relataly.com/wp-content/uploads/2020/06/scatterplot-python-machine-learning-relataly-midjourney-lineplot-financial-data-min.png 140w, https://www.relataly.com/wp-content/uploads/2020/06/scatterplot-python-machine-learning-relataly-midjourney-lineplot-financial-data-min.png 768w, https://www.relataly.com/wp-content/uploads/2020/06/scatterplot-python-machine-learning-relataly-midjourney-lineplot-financial-data-min.png 1024w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">Stockmarket forecasting with a neural network is about identifying meaningful patterns. Be aware that there is no guarantee that these patterns are present in the data. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.  </figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading" id="h-sources-and-further-reading">Sources and Further Reading</h2>



<ol class="wp-block-list"><li><a href="https://amzn.to/3MyU6Tj" target="_blank" rel="noreferrer noopener">Charu C. Aggarwal (2018) Neural Networks and Deep Learning</a></li><li><a href="https://amzn.to/3yIQdWi" target="_blank" rel="noreferrer noopener">Jansen (2020) Machine Learning for Algorithmic Trading: Predictive models to extract signals from market and alternative data for systematic trading strategies with Python</a></li><li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems </a></li><li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li><li><a href="https://amzn.to/3MAy8j5" target="_blank" rel="noreferrer noopener">Andriy Burkov (2020) Machine Learning Engineering</a></li></ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/">Stock Market Prediction using Multivariate Time Series and Recurrent Neural Networks in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/stock-market-prediction-using-multivariate-time-series-in-python/1815/feed/</wfw:commentRss>
			<slash:comments>17</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">1815</post-id>	</item>
	</channel>
</rss>
