<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Clustering Archives - relataly.com</title>
	<atom:link href="https://www.relataly.com/category/machine-learning/cluster-analysis-python/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.relataly.com/category/machine-learning/cluster-analysis-python/</link>
	<description>The Business AI Blog</description>
	<lastBuildDate>Thu, 01 Jun 2023 18:39:23 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://www.relataly.com/wp-content/uploads/2023/04/cropped-AI-cat-Icon-White.png</url>
	<title>Clustering Archives - relataly.com</title>
	<link>https://www.relataly.com/category/machine-learning/cluster-analysis-python/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">175977316</site>	<item>
		<title>Building &#8220;Chat with your Data&#8221; Apps using Embeddings, ChatGPT, and Cosmos DB for Mongo DB vCore</title>
		<link>https://www.relataly.com/step-by-step-guide-to-building-your-own-chatgpt-on-a-custom-knowledge-base-in-python-leveraging-mongo-db-and-embeddings/13687/</link>
					<comments>https://www.relataly.com/step-by-step-guide-to-building-your-own-chatgpt-on-a-custom-knowledge-base-in-python-leveraging-mongo-db-and-embeddings/13687/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Sat, 27 May 2023 13:25:08 +0000</pubDate>
				<category><![CDATA[ChatBots]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Generative AI]]></category>
		<category><![CDATA[Healthcare]]></category>
		<category><![CDATA[Insurance]]></category>
		<category><![CDATA[Language Generation]]></category>
		<category><![CDATA[Logistics]]></category>
		<category><![CDATA[Marketing Automation]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[OpenAI]]></category>
		<category><![CDATA[Prompt Engineering]]></category>
		<category><![CDATA[Retail]]></category>
		<category><![CDATA[Sentiment Analysis]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[Vector Databases]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=13687</guid>

					<description><![CDATA[<p>Artificial Intelligence (AI), in particular, the advent of OpenAI&#8217;s ChatGPT, has revolutionized how we interact with technology. Chatbots powered by this advanced language model can engage users in intricate, natural language conversations, marking a significant shift in AI capabilities. However, one thing that ChatGPT isn&#8217;t designed for is integrating personalized or proprietary knowledge – it&#8217;s ... <a title="Building &#8220;Chat with your Data&#8221; Apps using Embeddings, ChatGPT, and Cosmos DB for Mongo DB vCore" class="read-more" href="https://www.relataly.com/step-by-step-guide-to-building-your-own-chatgpt-on-a-custom-knowledge-base-in-python-leveraging-mongo-db-and-embeddings/13687/" aria-label="Read more about Building &#8220;Chat with your Data&#8221; Apps using Embeddings, ChatGPT, and Cosmos DB for Mongo DB vCore">Read more</a></p>
<p>The post <a href="https://www.relataly.com/step-by-step-guide-to-building-your-own-chatgpt-on-a-custom-knowledge-base-in-python-leveraging-mongo-db-and-embeddings/13687/">Building &#8220;Chat with your Data&#8221; Apps using Embeddings, ChatGPT, and Cosmos DB for Mongo DB vCore</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Artificial Intelligence (AI), in particular, the advent of OpenAI&#8217;s ChatGPT, has revolutionized how we interact with technology. Chatbots powered by this advanced language model can engage users in intricate, natural language conversations, marking a significant shift in AI capabilities. However, one thing that ChatGPT isn&#8217;t designed for is integrating personalized or proprietary knowledge – it&#8217;s built to draw upon general knowledge, not specifics about you or your organization. That&#8217;s where the concept of Retrieval Augmented Generation (RAG) comes into play. This article explores the exciting prospect of building your own ChatGPT that lets users ask questions on a custom knowledge base.</p>



<p class="wp-block-paragraph">In this tutorial, we&#8217;ll unveil the mystery behind enterprise ChatGPT, guiding you through the process of creating your very own custom ChatGPT &#8211; an AI-powered chatbot based on OpenAI&#8217;s powerful Generative Pretrained Transformers (GPT) technology. We&#8217;ll use Python and delve into the world of <a href="https://www.relataly.com/vector-databases-the-rising-star-in-generative-ai-infrastructure/13599/" target="_blank" rel="noreferrer noopener">vector databases</a>, specifically, <a href="https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/introduction" target="_blank" rel="noreferrer noopener">Mongo API for Azure Cosmos DB</a>, to show you how you can make a large knowledgebase available to ChatGPT that can go way beyond the <a href="https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them" target="_blank" rel="noreferrer noopener">typical token limitation of GPT models</a>.</p>



<p class="wp-block-paragraph">For experts, AI fans, or tech newbies, this guide simplifies building your ChatGPT. With clear instructions, useful examples, and tips, we aim to make it informative and empowering.</p>



<p class="wp-block-paragraph">We&#8217;ll explore AI, showing you how to customize your chatbot. We&#8217;ll simplify complex concepts and show you how to start your AI adventure from home or office. Ready to start this exciting journey? Keep reading!</p>



<p class="wp-block-paragraph">Also: </p>



<ul class="wp-block-list">
<li><a href="https://www.relataly.com/how-to-build-a-twitter-news-bot-with-openai-and-newsapi/13581/" target="_blank" rel="noreferrer noopener">How to Build a Twitter News Bot with OpenAI ChatGPT and NewsAPI in Python</a></li>



<li><a href="https://www.relataly.com/from-pirates-to-nobleman-simulating-conversations-between-openais-chatgpt-and-itself-using-python/13525/" target="_blank" rel="noreferrer noopener">From Pirates to Nobleman: Simulating Conversations between Various Characters using OpenAI’s ChatGPT and Python</a></li>
</ul>



<h2 class="wp-block-heading">Note on the use of Vector DBs and Costs.</h2>



<p class="wp-block-paragraph">Please note that this tutorial describes a business use case that utilizes a Cosmos DB for Mongo DB vCore hosted on the Azure cloud. </p>



<p class="wp-block-paragraph">Alternatively, you can set up an open-source vector database on your local machine, such as Milvus. Be aware that certain code adjustments will be necessary to proceed with the open-source alternative. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading" id="h-why-custom-chatgpt-is-so-powerful-and-versatile">Why Custom ChatGPT is so Powerful and Versatile</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">I believe we have all tested ChatGPT, and probably like me, you have been impressed by its remarkable capabilities. However, ChatGPT has a significant limitation: it can only answer questions and perform tasks based on the public knowledge base it was trained on. </p>



<p class="wp-block-paragraph">Imagine having a chatbot based on ChatGPT that communicates effectively and truly understands the nuances of your business, sector, or even a particular topic of interest. That&#8217;s the power of a custom ChatGPT. A tailor-made chatbot allows for specialized conversations, providing the needed information and drawing from a unique database you&#8217;ve developed. </p>



<p class="wp-block-paragraph">This becomes particularly beneficial in industries with specific terminologies or when you have a large database of knowledge that you want to make easily accessible and interactive. A custom ChatGPT, with its personalized and relevant responses, ensures a better user experience, effectively saving time and increasing productivity. </p>



<p class="wp-block-paragraph">Let&#8217;s delve into how to build such a solution. Spoiler it does not work by putting all the content into the prompt. But there is a great alternative. </p>



<h2 class="wp-block-heading">Understanding the Building Blocks of Custom ChatGPT with Retrieval Augmented Generation</h2>



<p class="wp-block-paragraph">The foundational technology behind ChatGPT is OpenAI&#8217;s Generative Pre-trained Transformer models (GPT). These models understand language by predicting the next word in a sentence and are trained on a diverse range of internet text. However, the GPT models, such as the GPT-3.5, have a limitation of processing 4096 tokens at a time. A token in this context is a chunk of text which can be as small as one character or as long as one word. For example, the phrase &#8220;ChatGPT is great&#8221; is four tokens long.</p>



<p class="wp-block-paragraph">Another challenge with Foundation Models such as ChatGPT is that they are trained on large-scale datasets that were available at the time of their training. This means they are not aware of any data created after their training period. Also, because they&#8217;re trained on broad, general-domain datasets, they may be less effective for tasks requiring domain-specific knowledge.</p>



<h3 class="wp-block-heading">How Retrieval Augmented Generation (RAG) Helps </h3>



<p class="wp-block-paragraph">Retrieval-Augmented Generation (RAG) is a method that combines the strength of transformer models with external knowledge to augment their understanding and applicability. Here&#8217;s a brief explanation:</p>



<p class="wp-block-paragraph">To address this, RAG retrieves relevant information from an external data source and uses this information to augment the input to the foundation model. This can make the model&#8217;s responses more informed and relevant.</p>



<h3 class="wp-block-heading">Data Sources</h3>



<p class="wp-block-paragraph">The external data can come from various sources like databases, document repositories, or APIs. To make this data compatible with the RAG approach, both the data and user queries are converted into numerical representations (embeddings) using language models.</p>



<h3 class="wp-block-heading">Data Preparation as Embeddings</h3>



<p class="wp-block-paragraph">The embeddings, which are essentially vectors, need to be stored in a database that&#8217;s efficient at storing and searching through these high-dimensional data. This is where Azure&#8217;s Cosmos Mongo DB comes into play. It&#8217;s a vector search database specifically designed for this task.</p>



<p class="wp-block-paragraph">To circumvent the token limitation and make your extensive data available to ChatGPT, we turn the data into embeddings. These are mathematical representations of your data, converting words, sentences, or documents into vectors. The advantage of using embeddings is that they capture the semantic meaning of the text, going beyond keywords to understand the context. In essence, similar information will have similar vectors, allowing us to cluster related information together and separate them from a semantically different text.</p>



<h3 class="wp-block-heading">Storing the Data in Vector Databases</h3>



<p class="wp-block-paragraph">The embeddings, which are essentially vectors, need to be stored in a database that&#8217;s efficient at storing and searching through these high-dimensional data. This is where Azure&#8217;s Cosmos Mongo DB comes into play. It&#8217;s a vector search database specifically designed for this task.</p>



<h3 class="wp-block-heading">Matching Queries to Knowledge</h3>



<p class="wp-block-paragraph">The RAG model compares the embeddings of user queries with those in the knowledge base to identify relevant information. The user&#8217;s original query is then augmented with context from similar documents in the knowledge base.</p>



<h3 class="wp-block-heading">Input to the Foundation Model</h3>



<p class="wp-block-paragraph">This augmented input is sent to the foundation model, enhancing its understanding and response quality.</p>



<h3 class="wp-block-heading">Updates</h3>



<p class="wp-block-paragraph">Importantly, the knowledge base and associated embeddings can be updated asynchronously, ensuring that the model remains up-to-date even as new information is added to the data sources.</p>



<p class="wp-block-paragraph">In sum, RAG extends the utility of foundation models by incorporating external, up-to-date, domain-specific knowledge into their understanding and output.</p>



<p class="wp-block-paragraph">By incorporating these components, you&#8217;ll be creating a robust custom ChatGPT that not only understands the user&#8217;s queries but also has access to your own information, giving it the ability to respond with precision and relevance. </p>



<p class="wp-block-paragraph">Ready to dive into the technicalities? Stay tuned!</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="512" height="277" data-attachment-id="13775" data-permalink="https://www.relataly.com/step-by-step-guide-to-building-your-own-chatgpt-on-a-custom-knowledge-base-in-python-leveraging-mongo-db-and-embeddings/13687/flo7up_a_vector_database_colorful_popart_with_an_ai_robot_worki_46d21322-5bd9-49f0-b1a7-b7b1a17536d5-copy-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/05/Flo7up_a_vector_database_colorful_popart_with_an_AI_robot_worki_46d21322-5bd9-49f0-b1a7-b7b1a17536d5-Copy-min.png" data-orig-size="1432,776" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Flo7up_a_vector_database_colorful_popart_with_an_AI_robot_worki_46d21322-5bd9-49f0-b1a7-b7b1a17536d5-Copy-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/05/Flo7up_a_vector_database_colorful_popart_with_an_AI_robot_worki_46d21322-5bd9-49f0-b1a7-b7b1a17536d5-Copy-min.png" src="https://www.relataly.com/wp-content/uploads/2023/05/Flo7up_a_vector_database_colorful_popart_with_an_AI_robot_worki_46d21322-5bd9-49f0-b1a7-b7b1a17536d5-Copy-min-512x277.png" alt="A tailor-made chatbot allows for specialized conversations, providing the exact information needed, drawing from a unique database that you've developed. " class="wp-image-13775" srcset="https://www.relataly.com/wp-content/uploads/2023/05/Flo7up_a_vector_database_colorful_popart_with_an_AI_robot_worki_46d21322-5bd9-49f0-b1a7-b7b1a17536d5-Copy-min.png 512w, https://www.relataly.com/wp-content/uploads/2023/05/Flo7up_a_vector_database_colorful_popart_with_an_AI_robot_worki_46d21322-5bd9-49f0-b1a7-b7b1a17536d5-Copy-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/05/Flo7up_a_vector_database_colorful_popart_with_an_AI_robot_worki_46d21322-5bd9-49f0-b1a7-b7b1a17536d5-Copy-min.png 768w, https://www.relataly.com/wp-content/uploads/2023/05/Flo7up_a_vector_database_colorful_popart_with_an_AI_robot_worki_46d21322-5bd9-49f0-b1a7-b7b1a17536d5-Copy-min.png 1432w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">A tailor-made chatbot allows for specialized conversations, providing the exact information needed, drawing from a unique database that you&#8217;ve developed. </figcaption></figure>
</div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<h2 class="wp-block-heading">Building the Custom &#8220;Chat with Your Data&#8221; App in Python</h2>



<p class="wp-block-paragraph">Now that we&#8217;ve discussed the theory behind building a custom ChatGPT and seen some exciting real-world applications, it&#8217;s time to put our knowledge into action! In this practical segment of our guide, we&#8217;re going to demonstrate how you can build a custom ChatGPT solution using Python.</p>



<p class="wp-block-paragraph">Our project will involve storing a sample PDF document in Cosmos Mongo DB and developing a chatbot capable of answering questions based on the content of this document. This practical exercise will guide you through the entire process, including turning your PDF content into embeddings, storing these embeddings in the Cosmos Mongo DB, and finally integrating it all with ChatGPT to build an interactive chatbot.</p>



<p class="wp-block-paragraph">If you&#8217;re new to Python, don&#8217;t worry, we&#8217;ll be breaking down the code and explaining each step in a straightforward manner. Let&#8217;s roll up our sleeves, fire up our Python environments, and get coding! Stay tuned as we embark on this exciting hands-on journey into the world of custom chatbots.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_c8e02b-b1"><a class="kb-button kt-button button kb-btn_022d60-c9 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/300%20Distributed%20Computing%20-%20Analyzing%20Zurich%20Weather%20Data%20using%20PySpark.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_8db802-ce kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading">How to Set Up Vector Search in Cosmos DB</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">First, you must understand that you will need a database to store the embeddings. It does not necessarily have to be a vector database. Still, this type of database will make your solution more performant and robust, particularly when you want to store large amounts of data.</p>



<p class="wp-block-paragraph"><a href="https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/vcore/" target="_blank" rel="noreferrer noopener">Azure Cosmos DB for MongoDB vCore</a> is the first MongoDB-compatible offering to feature Vector Search. With this feature, you can store, index, and query high-dimensional vector data directly in Azure Cosmos DB for MongoDB vCore, eliminating the need for data transfer to alternative platforms for vector similarity search capabilities. Here are the steps to set it up:</p>



<ol class="wp-block-list">
<li><strong>Choose Your Azure Cosmos DB Architecture:</strong> Azure Cosmos DB for MongoDB provides two types of architectures, RU-based and vCore-based. Each has its strengths and is best suited for certain types of applications. Choose the one that best fits your needs. If you&#8217;re looking to lift and shift existing MongoDB apps and run them as-is on a fully supported managed service, the vCore-based option could be the perfect fit.</li>



<li><strong>Configure Your Vector Search:</strong> Once your database architecture is set up, you can integrate your AI-based applications, including those using OpenAI embeddings, with your data already stored in Cosmos DB.</li>



<li><strong>Build and Deploy Your AI Application:</strong> With the Vector Search set up, you can now build your AI application that takes advantage of this feature. You can create a Go app using Azure Cosmos DB for MongoDB or deploy Azure Cosmos DB for MongoDB vCore using a Bicep template as suggested next steps.</li>
</ol>



<p class="wp-block-paragraph">Azure Cosmos DB for MongoDB vCore&#8217;s Vector Search feature is a game-changer for AI application development. It enables you to unlock new insights from your data, leading to more accurate and powerful applications.</p>



<h2 class="wp-block-heading">Cosmos DB for Mongo DB Usage Models</h2>



<p class="wp-block-paragraph">Regarding <a href="https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/choose-model" target="_blank" rel="noreferrer noopener">Cosmos DB for Mongo DB</a>, there are two options to choose from: Request Unit (RU) Database Account and vCore Cluster. Each option follows a different pricing model to suit diverse needs.</p>



<p class="wp-block-paragraph">The Request Unit (RU) Database Account operates on a pay-per-use basis. With this model, you are billed based on the number of requests and the level of provisioned throughput consumed by your workload.</p>



<p class="wp-block-paragraph">As of 27th Mai 2023, <a href="https://devblogs.microsoft.com/cosmosdb/introducing-vector-search-in-azure-cosmos-db-for-mongodb-vcore/" target="_blank" rel="noreferrer noopener">the brand new vector search function is only available for the vCore Cluster option</a>, which is why we will use this setup for this tutorial. The vCore Cluster offers a reserved managed instance. Under this option, you are charged a fixed amount on a monthly basis, providing more predictable costs for your usage.</p>



<p class="wp-block-paragraph">Once you have created your vCore instance, you must collect your connection string and make it available to your Python script. You can do this either by storing it in <a href="https://azure.microsoft.com/en-us/products/key-vault/">Azure Key Vault</a> (which I would recommend) or by storing it locally on your computer or in the code (which I would not recommend for obvious security reasons).</p>



<figure class="wp-block-image size-full"><img decoding="async" width="1536" height="588" data-attachment-id="13774" data-permalink="https://www.relataly.com/step-by-step-guide-to-building-your-own-chatgpt-on-a-custom-knowledge-base-in-python-leveraging-mongo-db-and-embeddings/13687/image-7-6/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/05/image-7.png" data-orig-size="1536,588" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-7" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/05/image-7.png" src="https://www.relataly.com/wp-content/uploads/2023/05/image-7.png" alt="" class="wp-image-13774" srcset="https://www.relataly.com/wp-content/uploads/2023/05/image-7.png 1536w, https://www.relataly.com/wp-content/uploads/2023/05/image-7.png 300w, https://www.relataly.com/wp-content/uploads/2023/05/image-7.png 512w, https://www.relataly.com/wp-content/uploads/2023/05/image-7.png 768w" sizes="(max-width: 1237px) 100vw, 1237px" /><figcaption class="wp-element-caption">When it comes to Cosmos DB for Mongo DB, there are two options to choose from: Request Unit (RU) Database Account and vCore Cluster. </figcaption></figure>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="334" height="512" data-attachment-id="13772" data-permalink="https://www.relataly.com/step-by-step-guide-to-building-your-own-chatgpt-on-a-custom-knowledge-base-in-python-leveraging-mongo-db-and-embeddings/13687/image-5-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/05/image-5.png" data-orig-size="606,929" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-5" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/05/image-5.png" src="https://www.relataly.com/wp-content/uploads/2023/05/image-5-334x512.png" alt="Azure Cosmos DB for Mongo DB is a new offering that is specifically designed for vector use cases (incl. embeddings)" class="wp-image-13772" srcset="https://www.relataly.com/wp-content/uploads/2023/05/image-5.png 334w, https://www.relataly.com/wp-content/uploads/2023/05/image-5.png 196w, https://www.relataly.com/wp-content/uploads/2023/05/image-5.png 606w" sizes="(max-width: 334px) 100vw, 334px" /><figcaption class="wp-element-caption">Azure Cosmos DB for Mongo DB is a new offering that is designed explicitly for vector use cases (incl. embeddings)</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">Using other Vector Databases</h2>



<p class="wp-block-paragraph">While Cosmos DB is a popular choice for vector databases, I would like to note that other options are available in the market. You can still benefit from this tutorial if you decide to utilize a different vector database, such as Pinncecone or Chroma. However, it is necessary to make code adjustments tailored to the APIs and functionalities of the specific vector database you choose.</p>



<p class="wp-block-paragraph">Specifically, you will need to modify the &#8220;insert embedding functions&#8221; and &#8220;similarity search functions&#8221; to align with the requirements and capabilities of your chosen vector database. These functions typically have variations that are specific to each vector database.</p>



<p class="wp-block-paragraph">By customizing the code according to your selected vector database&#8217;s API, you can successfully adapt the tutorial to suit your specific database choice. This allows you to leverage the principles and concepts this tutorial covers, regardless of the vector database you opt for.</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/vector-databases-the-rising-star-in-generative-ai-infrastructure/13599/" target="_blank" rel="noreferrer noopener">Vector Databases: The Rising Star in Generative AI Infrastructure</a></p>



<h3 class="wp-block-heading">Prerequisites</h3>



<p class="wp-block-paragraph">Before diving into the code, it’s essential to ensure that you have the proper setup for your Python 3 environment and have installed all the necessary packages. If you do not have a Python environment, follow the instructions in&nbsp;<a href="https://www.relataly.com/anaconda-python-environment-machine-learning/1663/" target="_blank" rel="noreferrer noopener">this tutorial</a>&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda Python environment</a>. This will provide you with a robust and versatile environment well-suited for machine learning and data science tasks.</p>



<p class="wp-block-paragraph">In this tutorial, we will be working with several libraries:</p>



<ul class="wp-block-list">
<li>openai</li>



<li>pymongo</li>



<li>PyPDF2</li>



<li>dotenv</li>
</ul>



<p class="wp-block-paragraph">Should you decide to use <a href="https://azure.microsoft.com/en-us/products/key-vault/" target="_blank" rel="noreferrer noopener">Azure Key Vault</a>, then you also need the following Python libraries:</p>



<ul class="wp-block-list">
<li>azure-identity</li>



<li>azure-key-vault</li>
</ul>



<p class="wp-block-paragraph">You can install the OpenAI Python library using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install&nbsp;</em>openai</li>



<li><em>conda install&nbsp;</em>openai&nbsp;(if you are using the Anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading">Step #1 Authentification and DB Setup</h3>



<p class="wp-block-paragraph">Let&#8217;s start with the authentification and setup of the API keys. After making necessary imports, the code gets things read to connect to essential services &#8211; OpenAI and Cosmos DB &#8211; and makes sure it can access these services properly.</p>



<ol class="wp-block-list">
<li><strong>Fetching Credentials:</strong> The script starts by setting up a connection to a service called Azure Key Vault to retrieve some crucial credentials securely. These are like &#8220;passwords&#8221; that the script needs to access various resources.</li>



<li><strong>Setting Up AI Services:</strong> Then, it prepares to connect to two different AI services. One is a version that&#8217;s hosted by Azure, and the other is the standard, public version.</li>



<li><strong>Establishing Database Connection:</strong> Lastly, the script sets up a connection to a database service, specifically to a certain collection within the Cosmos DB database. The script also checks if the connection to the database was successful by sending a &#8220;ping&#8221; &#8211; if it receives a response, it knows the connection is good.</li>
</ol>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">from azure.identity import AzureCliCredential
from azure.keyvault.secrets import SecretClient
import openai
import logging
import tiktoken
import pandas as pd
import pymongo
from dotenv import load_dotenv
load_dotenv()
# Set up the Azure Key Vault client and retrieve the Blob Storage account credentials
keyvault_name = ''
openaiservicename = ''
client = SecretClient(f&quot;https://{keyvault_name}.vault.azure.net/&quot;, AzureCliCredential())
print('keyvault service ready')
# AzureOpenAI Service
def setup_azureopenai():
    openai.api_key = client.get_secret('openai-api-key').value
    openai.api_type = &quot;azure&quot;
    openai.api_base = f'https://{openaiservicename}.openai.azure.com'
    openai.api_version = '2023-05-15'
    print('azure openai service ready')
# public openai service
def setup_public_openai():
    openai.api_key = client.get_secret('openai-api-key-public').value
    print('public openai service ready')
DB_NAME = &quot;hephaestus&quot;
COLLECTION_NAME = 'isocodes'
def setup_cosmos_connection():
    COSMOS_CLUSTER_CONNECTION_STRING = client.get_secret('cosmos-cluster-string').value
    cosmosclient = pymongo.MongoClient(COSMOS_CLUSTER_CONNECTION_STRING)
    db = cosmosclient[DB_NAME]
    collection = cosmosclient[DB_NAME][COLLECTION_NAME]
    # Send a ping to confirm a successful connection
    try:
        cosmosclient.admin.command('ping')
        print(&quot;Pinged your deployment. You successfully connected to MongoDB!&quot;)
    except Exception as e:
        print(e)
    return collection, db
setup_public_openai()
collection, db = setup_cosmos_connection()</pre></div>



<p class="wp-block-paragraph">Now we have set things up to interact with our Cosmos DB Mong DB vCore instance.</p>



<h3 class="wp-block-heading">Step #2 Functions for Populating the Vector DB</h3>



<p class="wp-block-paragraph">Next, we prepare and insert data into the database as embeddings. First, we prepare the content. The preparation process involves turning the text content into embeddings. Each embedding is a list of flats representing the meaning of a specific part of the text in a way the AI system can understand.</p>



<p class="wp-block-paragraph">We create the embeddings by sending text (for example, a paragraph of a document) to an OpenAI embedding model that returns the embedding. There are two options for using OpenAI: You can use the Azure OpenAI engine and deploy your own Ada embedding model. Alternatively, you can use the public OpenAI Ada embedding model. </p>



<p class="wp-block-paragraph">We&#8217;ll use the public OpenAI&#8217;s <a href="https://platform.openai.com/docs/guides/embeddings" target="_blank" rel="noreferrer noopener">text-embedding-ada-002</a>. Remember that the model is designed to return embeddings, not text. Model inference may incur costs based on the data processed. Refer to <a href="https://openai.com/pricing" target="_blank" rel="noreferrer noopener">OpenAI </a>or Azure OpenAI service for pricing details. </p>



<p class="wp-block-paragraph">Finally, the code inserts the prepared requests (which now include both the original text and the corresponding embeddings) into the database. The function returns the unique IDs assigned to these newly inserted items in the database. In this way, the code processes and stores the necessary information in the database for later use.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># prepare content for insertion into cosmos db
def prepare_content(text_content):
  embeddings = create_embeddings_with_openai(text_content)
  request = [
    {
    &quot;textContent&quot;: text_content, 
    &quot;vectorContent&quot;: embeddings}
  ]
  return request
# create embeddings
def create_embeddings_with_openai(input):
    #print('Generating response from OpenAI...')
    ###### uncomment for AzureOpenAI model usage and comment code below
    # embeddings = openai.Embedding.create( 
    #     engine='&lt;name of the embedding deployment &gt;', 
    #     input=input)[&quot;data&quot;][0][&quot;embedding&quot;]
    ###### public openai model usage and comment code above
    embeddings = openai.Embedding.create(
        model='text-embedding-ada-002', 
        input=input)[&quot;data&quot;][0][&quot;embedding&quot;]
    
    # Number of embeddings    
    # print(len(embeddings))
    return embeddings
# insert the requests
def insert_requests(text_input):
    request = prepare_content(text_input)
    return collection.insert_many(request).inserted_ids
# Creates a searchable index for the vector content
def create_index():
  
  # delete and recreate the index. This might only be necessary once.
  collection.drop_indexes()
  embedding_len = 1536
  print(f'creating index with embedding length: {embedding_len}')
  db.command({
    'createIndexes': COLLECTION_NAME,
    'indexes': [
      {
        'name': 'vectorSearchIndex',
        'key': {
          &quot;vectorContent&quot;: &quot;cosmosSearch&quot;
        },
        'cosmosSearchOptions': {
          'kind': 'vector-ivf',
          'numLists': 100,
          'similarity': 'COS',
          'dimensions': embedding_len
        }
      }
    ]
  })
# Resets the DB and deletes all values from the collection to avoid dublicates
#collection.delete_many({})</pre></div>



<h3 class="wp-block-heading">Step #3 Document Cracing and Populating the DB</h3>



<p class="wp-block-paragraph">The next step is to break down the PDF document into smaller chunks of text (in this case, &#8216;records&#8217;) and then process these records for future use. You can repeat this process for any document that you want to make available to OpenAI. </p>



<p class="wp-block-paragraph">You can use any PDF that you like as long as you it contains readable text (use OCR). For demo purposes, I will use a <a href="https://www.zh.ch/content/dam/zhweb/bilder-dokumente/themen/steuern-finanzen/steuern/quellensteuer/infobl%C3%A4tter/div_q_informationsblatt_qs_2021_EN.pdf" target="_blank" rel="noreferrer noopener">tax document from Zurich</a>. Put the document in the folder data/vector_db_data/ in your root folder and provide the name to the Python script. </p>



<p class="wp-block-paragraph">Want to read in many documents at once? If you want to insert many documents, read the pdf documents from the folder and use the names to populate a list. You can then surround the insert function with a for loop that iterates through the list of document names </p>



<h4 class="wp-block-heading">#3.1 Document Slicing Considerations </h4>



<p class="wp-block-paragraph">To convert a PDF into embeddings, the first step is to divide it into smaller content slices. The slicing process plays a crucial role as it affects the information provided to the OpenAI GPT model when answering user questions. If the slices are too large, the model may encounter token limitations. Conversely, if they are too small, the model may not receive sufficient content to answer the question effectively. It is important to strike a balance between the number of slices and their length to optimize the results, considering that the search process may yield multiple outcomes.</p>



<p class="wp-block-paragraph">There are several approaches to handle the slicing process. One option is to define the slices based on a specific number of sentences or paragraphs. Alternatively, you can iteratively slice the document, allowing for some overlap between the data in the vector database. This approach has the advantage of providing more precise information to answer questions, but it also increases the data volume in the vector database, which can impact speed and cost considerations.</p>



<h4 class="wp-block-heading">#3.2 Running the code below to crack a document and insert embeddings into the vector DB</h4>



<p class="wp-block-paragraph">Running the code below will first define a function that breaks text into separate paragraphs based on line breaks. Another function slices the PDF into records. Each record contains a certain number of sentences (the maximum is defined by the &#8216;max_sentences&#8217; value). We use a Python library called PyPDF2 to extract text from each page of the PDF and Python&#8217;s built-in regular expressions to split the text into sentences and paragraphs. Note that if you want to achieve better results, you could also use a professional document content extraction tool such as Azure form recognizer.</p>



<p class="wp-block-paragraph">The code then opens a specific PDF file (&#8216;zurich_tax_info_2023.pdf&#8217;) and slices it into records, each containing no more than a certain number of sentences (as defined by&#8217;max_sentences&#8217;). After that, the function inserts these records into the vector database. Finally, we print the count of documents in the database collection. This shows how many pieces of data are already stored in this specific part of the database.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># document cracking function to insert data from the excel sheet
def split_text_into_paragraphs(text):
    paragraphs = re.split(r'\n{2,}', text)
    return paragraphs
def slice_pdf_into_records(pdf_path, max_sentences):
    records = []
    
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        
        for page in reader.pages:
            text = page.extract_text()
            paragraphs = split_text_into_paragraphs(text)
            
            current_record = ''
            sentence_count = 0
            
            for paragraph in paragraphs:
                sentences = re.split(r'(?&lt;=[.!?])\s+', paragraph)
                
                for sentence in sentences:
                    current_record += sentence
                    
                    sentence_count += 1
                    
                    if sentence_count &gt;= max_sentences:
                        records.append(current_record)
                        current_record = ''
                        sentence_count = 0
                
                if sentence_count &lt; max_sentences:
                    current_record += ' '  # Add space between paragraphs
            
            # If there is remaining text after the loop, add it as a record
            if current_record:
                records.append(current_record)
    
    return records
# get file from root/data folder
pdf_path = '../data/vector_db_data/zurich_tax_info_2023.pdf'
max_sentences = 20  # Adjust the slice size as per your requirement
result = slice_pdf_into_records(pdf_path, max_sentences)
# print the length of result
print(f'{len(result)} vectors created with maximum {max_sentences} sentences each.')
# Print the sliced records
for i, record in enumerate(result):
    insert_requests(record)
    if i &lt; 5:
        print(record[0:100])
        print('-------------------')
create_index()
print(f'number of records in the vector DB: {collection.count_documents({})}')</pre></div>



<p class="wp-block-paragraph">After slicing the document and inserting the embeddings into the vector database, we can proceed with functions for similarity search and prompting. </p>



<h3 class="wp-block-heading">Step #4 Functions for Similarity Search and Prompts to ChatGPT</h3>



<p class="wp-block-paragraph">This section of code provides a set of functions to perform a vector search in the Cosmos DB, make a request to the ChatGPT 3.5 Turbo model for generating responses, and create prompts for the OpenAI model to use in generating those responses.</p>



<h4 class="wp-block-heading">#4.1 How the Search Part Works </h4>



<p class="wp-block-paragraph"><br>Allow me to provide a concise explanation of how the search process operates. We have now reached the stage where a user poses a question, and we utilize the OpenAI model to supply an answer, drawing from our vector database. Here, it&#8217;s vital to understand that the model transforms the question into embeddings and subsequently scours the knowledge base for similar embeddings that align with the information requested in the user&#8217;s prompt. </p>



<p class="wp-block-paragraph">The vector database yields the most suitable results and inserts them into another prompt tailored for ChatGPT. This model, distinct from the embedding model, generates text. Thus, the final interaction with the ChatGPT model incorporates both the user&#8217;s question and the results from the vector database, which are the most fitting responses to the question. This combination should ideally aid the model in providing the appropriate answer. Now, let&#8217;s turn our attention to the corresponding code.</p>



<h4 class="wp-block-heading">#4.2 Setting up the Functions for Vector Search</h4>



<p class="wp-block-paragraph">The vector_search function takes as input a query vector (representing a user&#8217;s question in vector form) and an optional parameter to limit the number of results. It then conducts a search in the Cosmos DB, looking for entries whose vector content is most similar to the query vector.</p>



<p class="wp-block-paragraph">Next, the openai_request function makes a request to OpenAI&#8217;s ChatGPT 3.5 Turbo model to generate a response. This function takes a formatted conversation history (or &#8216;prompt&#8217;) and sends it to the model, which then generates a response. The content of the generated response is then returned.</p>



<p class="wp-block-paragraph">The create_tweet_prompt function constructs the conversation history for the OpenAI model. This function takes the user&#8217;s question and a JSON object containing results from a database search and constructs a list of system and user messages. This list will then serve as the prompt for the ChatGPT model, instructing it to generate a response that answers the user&#8217;s question about tax, with the added guideline that the response should be in the same language as the question. The constructed prompt is then returned by the function.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Cosmos DB Vector Search API Command
def vector_search(vector_query, max_number_of_results=2):
  results = collection.aggregate([
    {
      '$search': {
        &quot;cosmosSearch&quot;: {
          &quot;vector&quot;: vector_query,
          &quot;path&quot;: &quot;vectorContent&quot;,
          &quot;k&quot;: max_number_of_results
        },
      &quot;returnStoredSource&quot;: True
      }
    }
  ])
  return results
# openAI request - ChatGPT 3.5 Turbo Model
def openai_request(prompt, model_engine='gpt-3.5-turbo'):
    completion = openai.ChatCompletion.create(model=model_engine, messages=prompt, temperature=0.2, max_tokens=500)
    return completion.choices[0].message.content
# define OpenAI Prompt for News Tweet
def create_prompt(user_question, result_json):
    instructions = f'You are an assistant that answers questions based on sources provided. \
    If the information is not in the provided source, you answer with &quot;I don\'t know&quot;. '
    task = f&quot;{user_question} Translate the response to english /n \
    source: {result_json}&quot;
    
    prompt = [{&quot;role&quot;: &quot;system&quot;, &quot;content&quot;: instructions }, 
              {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: task }]
    return prompt</pre></div>



<p class="wp-block-paragraph">You can easily change the voice and tone in which the ChatGPT answers questions by including the respective instructions in the create_prompt function. </p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/chatgpt-style-guide-understanding-voice-and-tone-options-for-engaging-conversations/13065/">ChatGPT Style Guide: Understanding Voice and Tone Prompt Options for Engaging Conversations</a></p>



<h3 class="wp-block-heading">Step #5 Testing the Custom ChatGPT Solution</h3>



<p class="wp-block-paragraph">This part of the code works with the previous functions to facilitate a complete question-answering cycle with Cosmos DB and OpenAI&#8217;s ChatGPT 3.5 Turbo model.</p>



<p class="wp-block-paragraph">Now comes the most exciting part. Testing the solution, you can define a question and then execute the code below to run the search process. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># define OpenAI Prompt 
users_question = &quot;When do I have to submit my tax return?&quot;
# generate embeddings for the question
user_question_embeddings = create_embeddings_with_openai(user_question)
# search for the question in the cosmos db
search_results = vector_search(user_question_embeddings, 1)
print(search_results)
# prepare the results for the openai prompt
result_json = []
# print each document in the result
# remove all empty values from the results json
search_results = [x for x in search_results if x]
for doc in search_results:
    display(doc.get('_id'), doc.get('textContent'), doc.get('vectorContent')[0:5])
    result_json.append(doc.get('textContent'))
# create the prompt
prompt = create_prompt(user_question, result_json)
display(prompt)
# generate the response
response = openai_request(prompt)
display(f'User question: {users_question}')
display(f'OpenAI response: {response}')</pre></div>



<p class="wp-block-paragraph">&#8216;User question: When do I have to submit my tax return?&#8217;</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;disableCopy&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">'OpenAI response: When do I have to submit my tax return? \n\nAll natural persons who had their residence in the canton of Zurich on December 31, 2022, or who owned properties or business premises (or business operations) in the canton of Zurich, must submit a tax return for 2022 in the calendar year 2023. Taxpayers with a residence in another canton also have to submit a tax return for 2022 in the calendar year 2023 if they ended their tax liability in the canton of Zurich by giving up a property or business premises during the calendar year 2022. If you turned 18 in the tax period 2022 (persons born in 2004), you must submit your own tax return (for the tax period 2022) for the first time in the calendar year 2023.'</pre></div>



<p class="wp-block-paragraph">As of Mai 2023, the knowledge base of ChatGPT 3.5 is limited to the timeframe before September 2021. So it&#8217;s evident that the response of our custom ChatGPT solution is based on the individual information provided in the vector database. Remember that we did not fine-tune the GPT model, so the model itself does not inherently know anything about your private data and instead uses the data that was dynamically provided to it as part of the prompt. </p>



<h2 class="wp-block-heading">Real-world Applications of Chat with your data</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Custom ChatGPT boosts efficiency, personalizes services, and improves experiences across industries. Here are some examples:</p>



<ul class="wp-block-list">
<li><strong>Customer Support:</strong> Companies can use ChatGPT for 24/7 customer service. With data from manuals, FAQs, and support docs, it delivers fast, accurate answers, enhancing customer satisfaction and lessening staff workload.</li>



<li><strong>Healthcare</strong>: ChatGPT can respond to patient questions using medical texts and care guidelines. It offers data on symptoms, treatments, side effects, and preventive care, helping both healthcare providers and patients.</li>



<li><strong>Legal Sector</strong>: Law firms can use ChatGPT with legal texts, court decisions, and case studies for answering legal questions, offering case references, or explaining legal terms.</li>



<li><strong>Financial Services:</strong> Banks can use ChatGPT to extend their customer service and give customers advice based on their individual financial situation.</li>



<li><strong>E-Learning:</strong> Schools and e-learning platforms can use ChatGPT to tutor students. Using textbooks, notes, and research papers, it helps students understand complex topics, solve problems, or guide them through a course.</li>
</ul>



<p class="wp-block-paragraph">In short, any sector needing a large information database for queries or services can use custom ChatGPT. It enhances engagement and efficiency by offering personalized experiences.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading">Summary</h2>



<p class="wp-block-paragraph">In this comprehensive guide, we&#8217;ve journeyed through the fascinating process of creating a customized ChatGPT that lets users chat with your business data. We started with understanding the immense value a tailored ChatGPT brings to the table and dove into its ability to produce specialized responses sourced from a custom knowledge base. This tailored approach enhances user experiences, saves time, and bolsters productivity.</p>



<p class="wp-block-paragraph">We went behind the scenes to reveal the vital elements of crafting a custom ChatGPT: OpenAI&#8217;s GPT models, data embeddings, and vector databases like Cosmos DB for Mongo DB vCore. We clarified how these components synergize to transcend the token limitations inherent to GPT models. By integrating the components in Python, we broadened ChatGPT&#8217;s ability to answer queries based on your private knowledgebase, thereby offering contextually appropriate responses.</p>



<p class="wp-block-paragraph">I hope this tutorial was able to illustrate the business value of ChatGPT and its versatile utility across a variety of sectors, including customer service, healthcare, legal services, finance, e-learning, and CRM data analytics. Each instance emphasized the transformative potential of a personalized ChatGPT in delivering efficient, targeted solutions.</p>



<p class="wp-block-paragraph">I hope you found this helpful article. If you have any questions or remarks, please drop them in the comment section.</p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<ul class="wp-block-list">
<li><a href="https://learn.microsoft.com/en-us/azure/cosmos-db/introduction" target="_blank" rel="noreferrer noopener">Azure Cosmos DB</a></li>



<li><a href="https://openai.com/pricing" target="_blank" rel="noreferrer noopener">OpenAI pricing</a></li>



<li><a href="https://learn.microsoft.com/en-us/azure/cognitive-services/openai/">Azure OpenAI</a></li>



<li><a href="https://azure.microsoft.com/en-au/products/cognitive-services/openai-service" target="_blank" rel="noreferrer noopener">Semantic search</a></li>



<li><a href="https://platform.openai.com/docs/guides/embeddings" target="_blank" rel="noreferrer noopener">What are embeddings?</a></li>



<li><a href="https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/vcore/vector-search" target="_blank" rel="noreferrer noopener">Using vector search on embeddings in Azure Cosmos DB for MongoDB vCore</a></li>



<li>OpenAI ChatGPT helped to revise this article</li>



<li>Images created with <a href="https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F" target="_blank" rel="noreferrer noopener">Midjourney</a></li>
</ul>
<p>The post <a href="https://www.relataly.com/step-by-step-guide-to-building-your-own-chatgpt-on-a-custom-knowledge-base-in-python-leveraging-mongo-db-and-embeddings/13687/">Building &#8220;Chat with your Data&#8221; Apps using Embeddings, ChatGPT, and Cosmos DB for Mongo DB vCore</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/step-by-step-guide-to-building-your-own-chatgpt-on-a-custom-knowledge-base-in-python-leveraging-mongo-db-and-embeddings/13687/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">13687</post-id>	</item>
		<item>
		<title>How to Use Hierarchical Clustering For Customer Segmentation in Python</title>
		<link>https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/</link>
					<comments>https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Thu, 22 Dec 2022 18:50:14 +0000</pubDate>
				<category><![CDATA[Agglomerative Clustering]]></category>
		<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Customer Segmentation]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Exploratory Data Analysis (EDA)]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Insurance]]></category>
		<category><![CDATA[Kaggle Competitions]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Marketing Automation]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[AI in Finance]]></category>
		<category><![CDATA[AI in Insurance]]></category>
		<category><![CDATA[Beginner Tutorials]]></category>
		<category><![CDATA[Classic Machine Learning]]></category>
		<category><![CDATA[Digital Transformation]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=11335</guid>

					<description><![CDATA[<p>Have you ever found yourself wondering how you can better understand your customer base and target your marketing efforts more effectively? One solution is to use hierarchical clustering, a method of grouping customers into clusters based on their characteristics and behaviors. By dividing your customers into distinct groups, you can tailor your marketing campaigns and ... <a title="How to Use Hierarchical Clustering For Customer Segmentation in Python" class="read-more" href="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/" aria-label="Read more about How to Use Hierarchical Clustering For Customer Segmentation in Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/">How to Use Hierarchical Clustering For Customer Segmentation in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Have you ever found yourself wondering how you can better understand your customer base and target your marketing efforts more effectively? One solution is to use hierarchical clustering, a method of grouping customers into clusters based on their characteristics and behaviors. By dividing your customers into distinct groups, you can tailor your marketing campaigns and personalize your marketing efforts to meet the specific needs of each group. This can be especially useful for businesses with large customer bases, as it allows them to target their marketing efforts to specific segments rather than trying to appeal to everyone at once. Additionally, hierarchical clustering can help businesses identify common patterns and trends among their customers, which can be useful for targeting future marketing efforts and improving the overall customer experience. In this tutorial, we will use Python and the scikit-learn library to apply hierarchical (agglomerative) clustering to a dataset of customer data. </p>



<p class="wp-block-paragraph">The rest of this tutorial proceeds in two parts. The first part will discuss hierarchical clustering and how we can use it to identify clusters in a set of customer data. The second part is a hands-on Python tutorial. We will explore customer health insurance data and apply an agglomerative clustering approach to group the customers into meaningful segments. Finally, we will use a tree-like diagram called a dendrogram, which is helpful for visualizing the structure of the data. The resulting segments could inform our marketing strategies and help us better understand our customers. So let&#8217;s get started!</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="896" height="510" data-attachment-id="12402" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png" data-orig-size="896,510" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png" alt="isometric view of people customer segmentation using machine learning python tutorial" class="wp-image-12402" srcset="https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png 896w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/isometric-view-of-people-customer-segmentation-using-machine-learning-python-tutorial-min.png 768w" sizes="(max-width: 896px) 100vw, 896px" /><figcaption class="wp-element-caption">Customer segmentation is a typical use case for clustering. Image generated with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>. </figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What is Hierarchical Clustering?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">So what is hierarchical clustering? Hierarchical clustering is a method of cluster analysis that aims to build a hierarchy of clusters. It creates a tree-like diagram called a dendrogram, which shows the relationships between clusters. There are two main types of hierarchical clustering: agglomerative and divisive. </p>



<ol class="wp-block-list">
<li>Agglomerative hierarchical clustering: This is a bottom-up approach in which each data point is treated as a single cluster at the outset. The algorithm iteratively merges the most similar pairs of clusters until all data points are in a single cluster.</li>



<li>Divisive hierarchical clustering: This is a top-down approach in which all data points are treated as a single cluster at the outset. The algorithm iteratively splits the cluster into smaller and smaller subclusters until each data point is in its own cluster.</li>
</ol>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading">Agglomerative Clustering</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this article, we will apply the agglomerative clustering approach, which is a bottom-up approach to clustering. The idea is to initially treat each data point in a dataset as its own cluster and then combine the points with other clusters as the algorithm progresses. The process of agglomerative clustering can be broken down into the following steps:</p>



<ol class="wp-block-list">
<li>Start with each data point in its own cluster.</li>



<li>Calculate the similarity between all pairs of clusters.</li>



<li>Merge the two most similar clusters.</li>



<li>Repeat steps 2 and 3 until all the data points are in a single cluster or until a predetermined number of clusters is reached.</li>
</ol>



<p class="wp-block-paragraph">There are several ways to calculate the similarity between clusters, including using measures such as the Euclidean distance, cosine similarity, or the Jaccard index. The specific measure used can impact the results of the clustering algorithm.</p>



<p class="wp-block-paragraph">For details on how the clustering approach works, see the&nbsp;<a href="https://en.wikipedia.org/wiki/Hierarchical_clustering">Wikipedia page</a>.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="430" height="512" data-attachment-id="13027" data-permalink="https://www.relataly.com/mushrooms_and_fruits_pattern-min-2/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png" data-orig-size="506,602" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="mushrooms_and_fruits_pattern-min" data-image-description="&lt;p&gt;Hierarchical clustering is an unsupversied way to classify things. &lt;/p&gt;
" data-image-caption="&lt;p&gt;Hierarchical clustering is an unsupversied way to classify things. &lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min-430x512.png" alt="Hierarchical clustering is an unsupversied way to classify things. " class="wp-image-13027" srcset="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png 430w, https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png 252w, https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png 506w" sizes="(max-width: 430px) 100vw, 430px" /><figcaption class="wp-element-caption">Hierarchical clustering is an unsupervised technique to classify things based on patterns in their data. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading">Hierarchical Clustering vs. K-means</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In a previous article, we have already discussed the popular <a href="https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/" target="_blank" rel="noreferrer noopener">clustering approach k-means</a>. So how are k-means and hierarchical clustering different? Hierarchical clustering and k-means are both clustering algorithms that can be used to group similar data points together. However, there are several key differences between these two approaches:</p>



<ol class="wp-block-list">
<li><strong>The number of clusters:</strong> In k-means, the number of clusters must be specified in advance, whereas in hierarchical clustering, the number of clusters is not specified. Instead, hierarchical clustering creates a hierarchy of clusters, starting with each data point as its own cluster and then merging the most similar clusters until all data points are in a single cluster.</li>



<li><strong>Cluster shape:</strong> K-means produces clusters that are spherical, while hierarchical clustering produces clusters that can have any shape. This means that k-means is better suited for data that is well-separated into distinct, spherical clusters, while hierarchical clustering is more flexible and can handle more complex cluster shapes.</li>



<li><strong>Distance measure:</strong> K-means uses a distance measure, such as the Euclidean distance, to calculate the similarity between data points, while hierarchical clustering can use a variety of distance measures. This means that k-means is more sensitive to the scale of the features, while hierarchical clustering is less sensitive to the feature scale.</li>



<li><strong>Computational complexity:</strong> K-means is generally faster than hierarchical clustering, especially for large datasets. This is because k-means only requires a single pass through the data to assign data points to clusters, while hierarchical clustering requires multiple passes to merge clusters.</li>



<li><strong>Visualization: </strong>Hierarchical clustering produces a tree-like diagram called a &#8220;dendrogram.&#8221; The dendrogram shows the relationships between clusters. This can be useful for visualizing the structure of the data and understanding how clusters are related.</li>
</ol>



<p class="wp-block-paragraph">Next, let&#8217;s look at how we can implement a hierarchical clustering model in Python. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<p class="wp-block-paragraph"></p>
</div>
</div>



<h2 class="wp-block-heading">Customer Segmentation using Hierarchical Clustering in Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this comprehensive guide, we explore the application of hierarchical clustering for effective customer segmentation using a customer dataset. This data-driven segmentation method enables businesses to identify distinct customer clusters based on various factors, including demographics, behaviors, and preferences.</p>



<p class="wp-block-paragraph">Customer segmentation is a strategic approach that splits a customer base into smaller, more manageable groups with similar characteristics. It aims to better understand the diverse needs and wants of different customer segments to enhance marketing strategies and product development.</p>



<p class="wp-block-paragraph">Applying customer segmentation through hierarchical clustering allows businesses to personalize their marketing messages, design targeted campaigns, and tailor products to meet the unique needs of each segment. This proactive approach can stimulate increased customer loyalty and sales.</p>



<p class="wp-block-paragraph">We begin by loading the customer data and selecting the relevant features we want to use for clustering. We then standardize the data using the StandardScaler from scikit-learn. Next, we apply hierarchical clustering using the AgglomerativeClustering method, specifying the number of clusters we want to create. Finally, we add the predictions to the original data as a new column and view the resulting segments by calculating the mean of each feature for each segment.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_bada6f-73"><a class="kb-button kt-button button kb-btn_43f94b-af kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/03%20Clustering/043%20Customer%20Segmentation%20using%20Hierarchical%20Clustering%20with%20Python.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_17702b-41 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="512" height="513" data-attachment-id="12366" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/the_future_of_the_healthcare_using_blockchain-min/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png" data-orig-size="512,513" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="the_future_of_the_healthcare_using_blockchain-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png" src="https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png" alt="In this machine learning tutorial, we will run a hierarchical clustering algorithm on health data." class="wp-image-12366" srcset="https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png 512w, https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/the_future_of_the_healthcare_using_blockchain-min.png 140w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">The future of healthcare will see a tight collaboration between humans and AI. Image generated using&nbsp;Midjourney</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading">About the Customer Health Insurance Dataset</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this tutorial, we will work with a public dataset on health_insurance_customer_data from kaggle.com. Download the <a href="https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset" target="_blank" rel="noreferrer noopener">CSV file from Kaggle</a> and copy it into the following path, starting from the folder with your python notebook: data/customer/</p>



<p class="wp-block-paragraph">The dataset is relatively simple and contains 1338 rows of insured customers. It includes the insurance charges, as well as demographic and personal information such as Age, Sex, BMI, Number of Children, Smoker, and Region. The dataset does not have any undefined or missing values.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before we start the coding part, ensure that you have set up your Python 3 environment and the required packages. If you don’t have an environment, follow&nbsp;this tutorial&nbsp;to set up the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda environment</a>. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li>pandas</li>



<li>NumPy</li>



<li>matplotlib</li>



<li>scikit-learn</li>
</ul>



<p class="wp-block-paragraph">You can install packages using console commands:</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">pip install &lt;package name&gt; 
conda install &lt;package name&gt; (if you are using the anaconda packet manager)</pre></div>



<h3 class="wp-block-heading">Step #1 Load the Data</h3>



<p class="wp-block-paragraph">To begin, we need to load the required packages and the data we want to cluster. We will load the data by reading the CSV file via the pandas library. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># import necessary libraries
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import LabelEncoder
from pandas.api.types import is_string_dtype
import pandas as pd
import math
import seaborn as sns

# load customer data
customer_df = pd.read_csv(&quot;data/customer/customer_health_insurance.csv&quot;)
customer_df.head(3)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	age	sex		bmi		children	smoker	region		charges
0	19	female	27.90	0			yes		southwest	16884.9240
1	18	male	33.77	1			no		southeast	1725.5523
2	28	male	33.00	3			no		southeast	4449.4620</pre></div>



<h3 class="wp-block-heading">Step #2 Explore the Data</h3>



<p class="wp-block-paragraph">Next, it is a good idea to explore the data and get a sense of its structure and content. This can be done using a variety of methods, such as examining the shape of the dataframe, checking for missing values, and plotting some basic statistics. For example, the following plots will explore the relationships between some of the variables. We won&#8217;t go into too much detail here.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def make_kdeplot(df, column_name, target_name):
    fig, ax = plt.subplots(figsize=(10, 6))
    sns.kdeplot(data=df, hue=column_name, x=target_name, ax = ax, linewidth=2,)
    ax.tick_params(axis=&quot;x&quot;, rotation=90, labelsize=10, length=0)
    ax.set_title(column_name)
    ax.set_xlim(0, df[target_name].quantile(0.99))
    plt.show()

# make kde plot for ext_color 
make_kdeplot(customer_df, 'smoker', 'charges')</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11363" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-17-3/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-17.png" data-orig-size="833,571" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-17" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-17.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-17.png" alt="" class="wp-image-11363" width="567" height="389" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-17.png 833w, https://www.relataly.com/wp-content/uploads/2022/12/image-17.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-17.png 768w" sizes="(max-width: 567px) 100vw, 567px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># make kde plot for ext_color 
make_kdeplot(customer_df, 'sex', 'charges')</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11364" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-44-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-44.png" data-orig-size="846,571" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-44" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-44.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-44.png" alt="" class="wp-image-11364" width="572" height="386" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-44.png 846w, https://www.relataly.com/wp-content/uploads/2022/12/image-44.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-44.png 768w" sizes="(max-width: 572px) 100vw, 572px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">sns.lmplot(x=&quot;charges&quot;, y=&quot;age&quot;, hue=&quot;smoker&quot;, data=customer_df, aspect=2)
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="11365" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-45/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-45.png" data-orig-size="1067,489" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-45" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-45.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-45-1024x469.png" alt="" class="wp-image-11365" width="700" height="321" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-45.png 1024w, https://www.relataly.com/wp-content/uploads/2022/12/image-45.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-45.png 768w, https://www.relataly.com/wp-content/uploads/2022/12/image-45.png 1067w" sizes="(max-width: 700px) 100vw, 700px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">def make_boxplot(customer_df, x,y,h):
    fig, ax = plt.subplots(figsize=(10,4))
    box = sns.boxplot(x=x, y=y, hue=h, data=customer_df)
    box.set_xticklabels(box.get_xticklabels())
    fig.subplots_adjust(bottom=0.2)
    plt.tight_layout()

make_boxplot(customer_df, &quot;smoker&quot;, &quot;charges&quot;, &quot;sex&quot;)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11366" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-46/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-46.png" data-orig-size="989,390" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-46" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-46.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-46.png" alt="" class="wp-image-11366" width="675" height="266" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-46.png 989w, https://www.relataly.com/wp-content/uploads/2022/12/image-46.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-46.png 768w" sizes="(max-width: 675px) 100vw, 675px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">make_boxplot(customer_df, &quot;region&quot;, &quot;charges&quot;, &quot;sex&quot;)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11367" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-47-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-47.png" data-orig-size="989,390" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-47" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-47.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-47.png" alt="" class="wp-image-11367" width="693" height="273" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-47.png 989w, https://www.relataly.com/wp-content/uploads/2022/12/image-47.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-47.png 768w" sizes="(max-width: 693px) 100vw, 693px" /></figure>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">make_boxplot(customer_df, &quot;children&quot;, &quot;bmi&quot;, &quot;sex&quot;)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="11368" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-48-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-48.png" data-orig-size="989,390" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-48" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-48.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-48.png" alt="" class="wp-image-11368" width="705" height="278" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-48.png 989w, https://www.relataly.com/wp-content/uploads/2022/12/image-48.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-48.png 768w" sizes="(max-width: 705px) 100vw, 705px" /></figure>



<p class="wp-block-paragraph">Next, let&#8217;s prepare the data for model training. </p>



<h3 class="wp-block-heading" id="h-step-3-prepare-the-data">Step #3 Prepare the Data</h3>



<p class="wp-block-paragraph">Before we can train a model on the data, we must prepare it for modeling. This typically involves selecting the relevant features, handling missing values, and scaling the data. However, we are using a very simple dataset that already has good data quality. Therefore we can limit our data preparation activities to encoding the labels and scaling the data. </p>



<p class="wp-block-paragraph">To encode the categorical values, we will use label encoder from the scikit-learn library.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># encode categorical features
label_encoder = LabelEncoder()

for col_name in customer_df.columns:
    if (is_string_dtype(customer_df[col_name])):
        customer_df[col_name] = label_encoder.fit_transform(customer_df[col_name])
customer_df.head(3)</pre></div>



<p class="wp-block-paragraph">Next, we will scale the numeric variables. While scaling the data is an essential preprocessing step for many machine learning algorithms to work effectively, it is generally not necessary for hierarchical clustering. This is because hierarchical clustering is not sensitive to the scale of the features. However, when you use certain distance measures, such as Euclidean distance, scaling the data might still be useful when performing hierarchical clustering. Scaling the data can help to ensure that all of the features are given equal weight. This can be useful if you want to avoid giving more weight to features with larger scales.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># select features
X = customer_df # we will select all features

# standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled.head(3)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">array([[-1.43876426, -1.0105187 , -0.45332   , ...,  1.34390459,
         0.2985838 ,  1.97058663],
       [-1.50996545,  0.98959079,  0.5096211 , ...,  0.43849455,
        -0.95368917, -0.5074631 ],
       [-0.79795355,  0.98959079,  0.38330685, ...,  0.43849455,
        -0.72867467, -0.5074631 ],
       ...,
       [-1.50996545, -1.0105187 ,  1.0148781 , ...,  0.43849455,
        -0.96159623, -0.5074631 ],
       [-1.29636188, -1.0105187 , -0.79781341, ...,  1.34390459,
        -0.93036151, -0.5074631 ],
       [ 1.55168573, -1.0105187 , -0.26138796, ..., -0.46691549,
         1.31105347,  1.97058663]])</pre></div>



<h3 class="wp-block-heading">Step #4 Train the Hierarchical Clustering Algorithm</h3>



<p class="wp-block-paragraph">To train a hierarchical clustering model using scikit-learn, we can use the AgglomerativeClustering or Ward class. The main parameters for these classes are:</p>



<ul class="wp-block-list">
<li><strong>n_clusters: </strong>The number of clusters to form. This parameter is required for AgglomerativeClustering but is not used for <code>Ward</code>.</li>



<li><strong>affinity: </strong>The distance measure used to calculate the similarity between pairs of samples. This can be any of the distance measures implemented in scikit-learn, such as the Euclidean distance or the cosine similarity.</li>



<li>l<strong>inkage: </strong>The method used to calculate the distance between clusters. This can be one of &#8220;ward,&#8221; &#8220;complete,&#8221; &#8220;average,&#8221; or &#8220;single.&#8221;</li>



<li><strong>distance_threshold:</strong> The maximum distance between two clusters that allows them to be merged. This parameter is only used in the AgglomerativeClustering class.</li>
</ul>



<p class="wp-block-paragraph">To train the model, we specify the desired parameters and fit the model to the data using the fit_predict method. This method will fit the model to the data and generate predictions in one step.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># apply hierarchical clustering 
model = AgglomerativeClustering(affinity='euclidean')
predicted_segments = model.fit_predict(X_scaled)</pre></div>



<p class="wp-block-paragraph">Now we have a trained clustering model also predicted the segments for our data.</p>



<h3 class="wp-block-heading">Step #5 Visualize the Results</h3>



<p class="wp-block-paragraph">After the model is trained, we can visualize the results to get a better understanding of the clusters that were formed. There is a wide range of plots and tools to visualize clusters. In this tutorial, we will use a scatterplot and a dendrogram. </p>



<h4 class="wp-block-heading">5.1 Scatterplot</h4>



<p class="wp-block-paragraph">For this, we can use the lmplot function in Seaborn. The lmplot creates a 2D scatterplot with an optional overlay of a linear regression model. The plot visualizes the relationship between two variables and fits a linear regression model to the data that can highlight differences. In the following, we use this linear regression model to highlight the differences between our two cluster segments and the age of the customers. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># add predictions to data as a new column
customer_df['segment'] = predicted_segments

# create a scatter plot of the first two features, colored by segment
sns.lmplot(x=&quot;charges&quot;, y=&quot;age&quot;, hue=&quot;segment&quot;, data=customer_df, aspect=2)
plt.show()</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="470" data-attachment-id="11370" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-49-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-49.png" data-orig-size="1065,489" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-49" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-49.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-49-1024x470.png" alt="" class="wp-image-11370" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-49.png 1024w, https://www.relataly.com/wp-content/uploads/2022/12/image-49.png 300w, https://www.relataly.com/wp-content/uploads/2022/12/image-49.png 768w, https://www.relataly.com/wp-content/uploads/2022/12/image-49.png 1065w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p class="wp-block-paragraph">We can see that our model has determined two clusters in our data. The clusters seem to correspond well with the smoker category, which indicates that this attribute is decisive in forming relevant groups.</p>



<h4 class="wp-block-heading" id="h-5-2-dendrogram">5.2 Dendrogram</h4>



<p class="wp-block-paragraph">The hierarchical clustering approach lets us visualize relationships between different groups in our dataset in a dendrogram. A dendrogram is a graphical representation of a hierarchical structure, such as the relationships between different groups of objects or organisms. It is typically used in biology to show the relationships between different species or taxonomic groups, but it can also be used in other fields to represent the hierarchical structure of any set of data. In a dendrogram, the objects or groups being studied are represented as branches on a tree-like diagram. The branches are usually labeled with the names of the objects or groups, and the lengths of the branches represent the distances or dissimilarities between the objects or groups. The branches are also arranged in a hierarchical manner, with the most closely related objects or groups being placed closer together and the more distantly related ones being placed farther apart.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Visualize data similarity in a dendogram
def plot_dendrogram(model, **kwargs):
    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx &lt; n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, orientation='right',**kwargs)


plt.title(&quot;Hierarchical Clustering Dendrogram&quot;)
# plot the top three levels of the dendrogram
plot_dendrogram(cluster_model, truncate_mode=&quot;level&quot;, p=4)
plt.xlabel(&quot;Euclidean Distance&quot;)
plt.ylabel(&quot;Number of points in node (or index of point if no parenthesis).&quot;)
plt.show()</pre></div>



<p class="wp-block-paragraph">Source: This code block is based on code <a href="https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html" target="_blank" rel="noreferrer noopener">from the scikit-learn page</a></p>



<figure class="wp-block-image size-full"><img decoding="async" width="575" height="453" data-attachment-id="11396" data-permalink="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/image-53-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-53.png" data-orig-size="575,453" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-53" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-53.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-53.png" alt="" class="wp-image-11396" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-53.png 575w, https://www.relataly.com/wp-content/uploads/2022/12/image-53.png 300w" sizes="(max-width: 575px) 100vw, 575px" /></figure>



<h2 class="wp-block-heading">Summary</h2>



<p class="wp-block-paragraph">In conclusion, hierarchical clustering is a powerful tool for customer segmentation that can help businesses better understand their customer base and target their marketing efforts more effectively. By grouping customers into clusters based on their characteristics and behaviors, companies can create targeted campaigns and personalize their marketing efforts to better meet the needs of each group. Using Python and the scikit-learn library, we were able to apply an agglomerative clustering approach to a dataset of customer data and identify two distinct segments. We can then use these segments to inform our marketing strategies and get a better understanding of our customers.</p>



<p class="wp-block-paragraph">By the way, customer segmentation is an area where real-world data can be prone to bias and unfairness. If you&#8217;re concerned about this, check out our latest article on <a href="https://www.relataly.com/building-fair-machine-machine-learning-models-with-fairlearn/12804/" target="_blank" rel="noreferrer noopener">addressing fairness in machine learning with fairlearn</a>.</p>



<p class="wp-block-paragraph">I hope this article was useful. If you have any feedback, please write your thoughts in the comments. </p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p class="wp-block-paragraph">Articles</p>



<ul class="wp-block-list">
<li><a href="https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html" target="_blank" rel="noreferrer noopener">https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html</a></li>



<li>Images generated with OpenAI Dall-E and Midjourney.</li>
</ul>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<h4 class="wp-block-heading"><strong>Books on Clustering</strong></h4>



<ul class="wp-block-list">
<li><a href="https://amzn.to/3Gb5kfj" target="_blank" rel="noreferrer noopener">&#8220;Data Clustering: Algorithms and Applications&#8221; by Charu C. Aggarwal</a>: This book covers a wide range of clustering algorithms, including hierarchical clustering, and discusses their applications in various fields.</li>



<li><a href="https://amzn.to/3WmhGXB" target="_blank" rel="noreferrer noopener">&#8220;Data Mining: Practical Machine Learning Tools and Techniques&#8221; by Ian H. Witten and Eibe Frank</a>: This book is a comprehensive introduction to data mining and machine learning, including a chapter on hierarchical clustering.</li>
</ul>



<div style="display: inline-block;">
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=0128042915&amp;asins=0128042915&amp;linkId=1e9fe160a76f7255e3eea8e0119ca74f&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>

<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=B00EYROAQU&amp;asins=B00EYROAQU&amp;linkId=ba1fcb8a59417e729afe33f6eceb2a9f&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>
</div></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<h4 class="wp-block-heading"><strong>Books on Machine Learning</strong></h4>



<ul class="wp-block-list">
<li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning</a></li>



<li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>
</ul>



<div style="display: inline-block;">

  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>

<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>
</div></div>
</div>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<p class="wp-block-paragraph"><strong>Relataly articles on clustering and machine learning</strong></p>



<ul class="wp-block-list">
<li><a href="https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/" target="_blank" rel="noreferrer noopener">Simple Clustering using K-means in Python</a>: This article gives an overview of cluster analysis with k-means.</li>



<li><a href="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/" target="_blank" rel="noreferrer noopener">Clustering crypto markets using affinity propagation in Python</a>: This article applies cluster analysis to crypto markets and creates a market map for various cryptocurrencies.</li>



<li><a href="https://www.relataly.com/building-fair-machine-machine-learning-models-with-fairlearn/12804/" target="_blank" rel="noreferrer noopener">Addressing fairness in machine learning with the fairlearn library</a></li>
</ul>



<p class="wp-block-paragraph"></p>
<p>The post <a href="https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/">How to Use Hierarchical Clustering For Customer Segmentation in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/customer-segmentation-using-hierarchical-clustering-in-python/11335/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">11335</post-id>	</item>
		<item>
		<title>Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python</title>
		<link>https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/</link>
					<comments>https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Mon, 02 May 2022 18:34:02 +0000</pubDate>
				<category><![CDATA[Affinity Propagation (Clustering)]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Coinmarketcap API]]></category>
		<category><![CDATA[Correlation]]></category>
		<category><![CDATA[Covariance]]></category>
		<category><![CDATA[Crypto Exchange APIs]]></category>
		<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Dimensionality Reduction]]></category>
		<category><![CDATA[Finance]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Stock Market Forecasting]]></category>
		<category><![CDATA[Time Series Forecasting]]></category>
		<category><![CDATA[Advanced Tutorials]]></category>
		<category><![CDATA[AI in Finance]]></category>
		<category><![CDATA[Cryptocurrencies]]></category>
		<category><![CDATA[Financial Analysis]]></category>
		<category><![CDATA[Stock Market Cluster Analysis]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=8114</guid>

					<description><![CDATA[<p>Affinity propagation is a powerful unsupervised clustering technique that can identify hidden patterns in large datasets. In the cryptocurrency world, where new coins are constantly emerging and prices can be highly volatile, affinity propagation can help investors simplify the chaos. By analyzing historical price data, affinity propagation groups coins into clusters based on their past ... <a title="Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python" class="read-more" href="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/" aria-label="Read more about Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/">Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Affinity propagation is a powerful unsupervised clustering technique that can identify hidden patterns in large datasets. In the cryptocurrency world, where new coins are constantly emerging and prices can be highly volatile, affinity propagation can help investors simplify the chaos.</p>



<p class="wp-block-paragraph">By analyzing historical price data, affinity propagation groups coins into clusters based on their past price fluctuations. Such a cluster analysis enables crypto investors to identify promising entry and exit points, ultimately helping them make smarter investment decisions.</p>



<p class="wp-block-paragraph">To use this technique effectively, it&#8217;s important to understand essential concepts such as covariance, lasso regression, and affinity propagation. Once you understand these concepts, you can apply them to analyze price time series data and identify hidden patterns.</p>



<p class="wp-block-paragraph">Finally, visualizing the results in two and three dimensions can better understand the relationships between coins and their respective clusters. The resulting crypto market map can be a powerful tool for investors to gain insight into the market&#8217;s structure and make informed investment decisions.</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<div class="wp-block-kadence-infobox kt-info-box_317393-a1"><span class="kt-blocks-info-box-link-wrap info-box-link kt-blocks-info-box-media-align-top kt-info-halign-left"><div class="kt-infobox-textcontent"><h2 class="kt-blocks-info-box-title">Disclaimer</h2><p class="kt-blocks-info-box-text">This article does not constitute financial advice. Stock markets can be very volatile and are generally difficult to predict. Predictive models and other forms of analytics applied in this article only serve the purpose of illustrating machine learning use cases.</p></div></span></div>
</div></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading">What is Stock Market Clustering?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Clustering stock markets refers to grouping stocks based on their similarities or common characteristics. This can be done using various clustering algorithms, which analyze the data and assign each stock market to a cluster based on its similarity to other stock markets in the same cluster. In this article, we will run a cluster analysis on historical time series data. This approach involves grouping stocks into clusters based on their historical performance over a certain period of time. </p>



<p class="wp-block-paragraph">Clustering stock market data can be useful for a variety of purposes, such as identifying patterns or trends in the data, comparing the performance of different stocks or sectors, or generating investment recommendations. However, it&#8217;s important to keep in mind that clustering is just one tool among many for analyzing stock market data, and it&#8217;s important to consider a range of factors when making investment decisions. It can also be used to compare the performance of different stock markets and identify potential risks or correlations between them.</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/cryptocurrency-price-charts-with-color-overlay-python/2820/" target="_blank" rel="noreferrer noopener">Color-Coded Cryptocurrency Price Charts in Python</a></p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="509" height="506" data-attachment-id="12694" data-permalink="https://www.relataly.com/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" data-orig-size="509,506" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="neural network machine learning python affinity propagation midjourney relataly crypto-min" data-image-description="&lt;p&gt;neural network machine learning python affinity propagation midjourney relataly crypto-min&lt;/p&gt;
" data-image-caption="&lt;p&gt;neural network machine learning python affinity propagation midjourney relataly crypto-min&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png" alt="neural network machine learning python midjourney relataly crypto market map" class="wp-image-12694" srcset="https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 509w, https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/neural-network-machine-learning-python-affinity-propagation-midjourney-relataly-crypto-min.png 140w" sizes="(max-width: 509px) 100vw, 509px" /><figcaption class="wp-element-caption">We can use a crypto market map to illustrate the price correlation between cryptocurrencies.</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading">What&#8217;s the Problem with Prototype-based Clustering?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Clustering is an unsupervised learning technique that groups similar objects into clusters and separates them from different ones. One of the most popular clustering techniques is <a href="https://www.relataly.com/category/machine-learning-algorithms/k-means/" target="_blank" rel="noreferrer noopener">k-means</a>. K-means belongs to the so-called prototype-based clustering techniques, which divide data points into a predefined number of groups (in the case of k-means, the groups are of equal variance). </p>



<p class="wp-block-paragraph">The prototype-based clustering approach works great if the number of clusters in a dataset is known and the clusters have similar despair. However, when we deal with real-world problems, we often encounter more complex data for which the optimal number of clusters is unknown and difficult or even impossible to guess. In such a case, affinity propagation has a significant advantage because it can automatically estimate the number of clusters. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading">Affinity Propagation: What it is and How it Works </h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">The idea of affinity propagation is to identify clusters by measuring the similarity of data points relative to one another. The algorithm chooses data points as cluster centers that best represent other data points near them. </p>



<p class="wp-block-paragraph">We can imagine the process of identifying these representative data points as an election. Each data point (i) is a voter who casts votes and a candidate (k) who can receive votes from other voters. Votes are a measure of the similarity of data points. A voter who gives many votes to a candidate expresses that this data point is similar to him and therefore is suitable for representing him as a cluster center. The voting process continues until the algorithm reaches a consensus and selects a set number of cluster candidates.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="383" data-attachment-id="8208" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/image-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/image-5.png" data-orig-size="1584,592" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-5" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/image-5.png" src="https://www.relataly.com/wp-content/uploads/2022/05/image-5-1024x383.png" alt="Affinity Propagation: Data points cast votes for candidates and receive votes from other data points " class="wp-image-8208" srcset="https://www.relataly.com/wp-content/uploads/2022/05/image-5.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/image-5.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/image-5.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/image-5.png 1536w, https://www.relataly.com/wp-content/uploads/2022/05/image-5.png 1584w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Affinity Propagation: Data points cast votes for candidates and receive votes from other data points </figcaption></figure>
</div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">The clustering process involves many separate steps (<a href="https://towardsdatascience.com/unsupervised-machine-learning-affinity-propagation-algorithm-explained-d1fef85f22c8" target="_blank" rel="noreferrer noopener">This article</a> provides a detailed description of the steps involved) and works with several matrices: </p>



<ul class="wp-block-list">
<li>The similarity matrix assesses the suitability of data points (candidates) to act as cluster centers.</li>



<li>The availability matrix (or responsibility matrix) collects the support of the data points for the candidates (potential cluster centers) and their suitability to represent them.</li>



<li>The criterion matrix sums up the results and defines the clusters. Data points with equal scores in the criterion matrix are considered part of the same cluster.</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-full"><img decoding="async" width="691" height="334" data-attachment-id="8274" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/image-9/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/image-9.png" data-orig-size="691,334" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-9" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/image-9.png" src="https://www.relataly.com/wp-content/uploads/2022/05/image-9.png" alt="Criterion Matrix: Data Points (Cryptos) with equal numbers are part of the same cluster" class="wp-image-8274" srcset="https://www.relataly.com/wp-content/uploads/2022/05/image-9.png 691w, https://www.relataly.com/wp-content/uploads/2022/05/image-9.png 300w" sizes="(max-width: 691px) 100vw, 691px" /><figcaption class="wp-element-caption">Criterion Matrix: Data Points (Cryptos) with equal numbers are part of the same cluster</figcaption></figure>
</div>
</div>



<h2 class="wp-block-heading" id="h-time-series-clustering-using-affinity-propagation-visualizing-cryptocurrency-market-structures-in-python">Time Series Clustering using Affinity Propagation &#8211; Visualizing Cryptocurrency Market Structures in Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Ready to implement affinity propagation in Python to analyze the crypto market structure and create a visual representation of price similarity? Let&#8217;s dive in!</p>



<p class="wp-block-paragraph">First, we define a portfolio of cryptocurrencies and download their historical price quotes from coinmarketcap. We then visualize the time series on separate line charts to ensure that the data has been loaded successfully. After preparing and cleaning the data, we can move on to clustering the cryptocurrencies into groups with similar price movements using Affinity Propagation.</p>



<p class="wp-block-paragraph">Unlike other clustering algorithms, we don&#8217;t set the number of clusters in advance. Instead, we let affinity propagation determine the optimal number of clusters for our portfolio. Finally, we calculate the covariance matrix between clusters and arrange the cryptocurrencies on a 2D map into clusters. We create a network overlay based on covariance to better understand the relationships between different clusters.</p>



<p class="wp-block-paragraph">With affinity propagation, we can identify hidden patterns in the crypto market and group coins into clusters based on their past price fluctuations. This process allows us to identify promising entry and exit points, ultimately helping us make smarter investment decisions. Plus, the 2D map and network overlay help us visualize the relationships between different clusters and coins.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="456" height="509" data-attachment-id="10401" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/image-33-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/12/image-33.png" data-orig-size="456,509" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-33" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/12/image-33.png" src="https://www.relataly.com/wp-content/uploads/2022/12/image-33.png" alt="exemplary price correlation map created with the help of the affinity propagation clustering algorithm, python scikit-learn" class="wp-image-10401" srcset="https://www.relataly.com/wp-content/uploads/2022/12/image-33.png 456w, https://www.relataly.com/wp-content/uploads/2022/12/image-33.png 269w" sizes="(max-width: 456px) 100vw, 456px" /><figcaption class="wp-element-caption">We can use affinity propagation to cluster financial assets and visualize them on a map.</figcaption></figure>
</div>
</div>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The Python code for this tutorial is available in the relataly repository on GitHub.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_03b447-31"><a class="kb-button kt-button button kb-btn_03610b-ed kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/00%20Data%20Visualization/042%20Vizualizing%20Stock%20Market%20Structures%20using%20Cluster%20Analysis%20in%20Python.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_81f956-88 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Before beginning the coding part, ensure that you have set up your Python 3 environment and required packages. Consider <a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda </a>if you don&#8217;t have a Python environment set up yet. To set it up, you can follow the steps in&nbsp;<a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a>. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></li>



<li><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></li>



<li><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></li>



<li><a href="https://seaborn.pydata.org/api.html" target="_blank" rel="noreferrer noopener">Seaborn</a></li>
</ul>



<p class="wp-block-paragraph">Please also make sure you have the <a href="https://pypi.org/project/cryptocmd/" target="_blank" rel="noreferrer noopener">Cmcscaper</a> package installed. We will be using it to download past crypto prices from coinmarketcap.</p>



<p class="wp-block-paragraph">You can install these packages using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-step-1-load-the-stock-market-data">Step #1: Load the Stock Market Data</h3>



<p class="wp-block-paragraph">We start by loading historical crypto price data from Coinmarketcap. To download the data, we use Cmcscraper, a Python library that allows us to collect Coinmarketcap data without signing up for the official API.</p>



<p class="wp-block-paragraph">The download returns a dataframe with daily price quotes (Close, Open, Avg) for cryptocurrencies between 2016 and today. You can use the dictionary (&#8220;symbol_dict&#8221;) to control which cryptos you want to include in the data. We limit the data we use in our cluster analysis to the last 50 days. In this way, we let the correlation consider earlier price developments. But it&#8217;s up to you to specify a different period. In addition, instead of using absolute price values, we will use daily percentage fluctuations.</p>



<p class="wp-block-paragraph">Also: <a href="https://www.relataly.com/streaming-crypto-prices-via-the-gate-io-api-with-python/3982/" target="_blank" rel="noreferrer noopener">Requesting Crypto Price Data from the Gate.io REST API in Python</a></p>



<p class="wp-block-paragraph">Loading the data can take several minutes, depending on how many cryptocurrencies we include in the request. So it makes sense not to load the data every time you run the code. Therefore, the code below stores the historical prices in a CSV file. </p>



<p class="wp-block-paragraph">The script will check if the data already exists if you run the code below. If it does, it will use the data from the CSV file. Otherwise, it will load a fresh copy of the data from coinmarketcap.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># A tutorial for this file is available at www.relataly.com
# Tested with Python 3.8.8, Matplotlib 3.5, Scikit-learn 0.24.1, Seaborn 0.11.1, numpy 1.19.5

from cryptocmd import CmcScraper
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns
from sklearn import cluster, covariance, manifold
import requests
import json


#get a dictionary of the top 100 coin symbols and names from an API
def get_symbol_dict():
    url = 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&amp;limit=50&amp;sortBy=market_cap&amp;sortType=desc&amp;convert=USD&amp;cryptoType=all&amp;tagType=all&amp;audited=false'
    response = requests.get(url)
    data = json.loads(response.text)
    df = pd.DataFrame(data['data']['cryptoCurrencyList'])

    # exclude stable coins
    df = df[~df['symbol'].isin(['USDT', 'USDC', 'BUSD', 'DAI', 'TUSD', 'PAX', 'GUSD', 'HUSD', 'USDK', 'USDS', 'USDP', 'USDN', 'USDSB', 'USDX', 'USD++', 'BIDR', 'IDRT', 'VAI', 'BGBP'])]
    df = df[['symbol', 'name']]
    df = df.set_index('symbol')
    df = df.to_dict()
    df = df['name']
    return df

symbol_dict = get_symbol_dict()


# Download historic crypto prices via CmcScraper
def load_fresh_data_and_save_to_disc(symbol_dict, save_path):
    # Extract symbols and names from the symbol_dict
    symbols, names = np.array(sorted(symbol_dict.items())).T
    
    # Initialize an empty DataFrame for storing the prices
    df_crypto = pd.DataFrame()

    # Download and process the price data for each symbol
    for symbol in symbols:
        print(f&quot;Fetching prices for {symbol}...&quot;)
        
        # Download the price data using CmcScraper
        scraper = CmcScraper(symbol)
        df_coin_prices = scraper.get_dataframe()

        # Process the price data and add it to df_crypto
        df = pd.DataFrame({
            f&quot;{symbol}_Open&quot;: df_coin_prices[&quot;Open&quot;],
            f&quot;{symbol}_Close&quot;: df_coin_prices[&quot;Close&quot;],
            f&quot;{symbol}_Avg&quot;: (df_coin_prices[&quot;Close&quot;] + df_coin_prices[&quot;Open&quot;]) / 2,
            f&quot;{symbol}_p&quot;: (df_coin_prices[&quot;Open&quot;] - df_coin_prices[&quot;Close&quot;]) / df_coin_prices[&quot;Open&quot;]
        })
        df_crypto = pd.concat([df_crypto, df], axis=1)

    # Save the price data to a CSV file
    X_df_filtered = df_crypto.filter(like=&quot;_p&quot;)
    X_df_filtered.to_csv(save_path + &quot;historical_crypto_prices.csv&quot;)

    return names, symbols, X_df_filtered
        

# If set to False the data will only be downloaded when you execute the code
# Set to True, if you want a fresh copy of the data.  
fetch_new_data = True 
save_path = '' # path where the price data will be stored in a csv file

# Fetch fresh data via the scraping package, or use data from the csv file on disk
if fetch_new_data == False:
    try:
        print('loading from disk')
        X_df_filtered = pd.read_csv(save_path + 'historical_crypto_prices.csv')
        if 'Unnamed: 0' in X_df_filtered.columns: 
            X_df_filtered = X_df_filtered.drop(['Unnamed: 0'], axis=1)
            symbols, names = np.array(sorted(symbol_dict.items())).T
        print(list(X_df_filtered.columns))
    except:
        print('no existing price data found - loading fresh data from coinmarketcap and saving them to disk')
        names, symbols, X_df_filtered = load_fresh_data_and_save_to_disc(symbol_dict, save_path)
        print(list(symbols))
else:
       print('loading fresh data from coinmarketcap and saving them to disk')
       names, symbols, X_df_filtered = load_fresh_data_and_save_to_disc(symbol_dict, save_path)
       print(list(symbols))

# Limit the price data to the last t days
t= 14 # in days
X_df_filtered = X_df_filtered[:t]
X_df_filtered.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	ACM_p		ADA_p		ARK_p		ATM_p		ATOM_p		AVAX_p		BAT_p		BCH_p		BLZ_p		BNB_p		...	THETA_p		UNI_p		USDT_p		VET_p		WAVES_p		XLM_p		XMR_p		XRP_p		ZIL_p		ZRX_p
0	0.031987	-0.037645	-0.005702	0.030928	-0.005897	-0.012404	-0.012262	-0.022529	0.008072	-0.007111	...	-0.021994	-0.023758	-0.000103	-0.021024	-0.015416	-0.004096	-0.022988	-0.027397	-0.016659	-0.012255
1	0.028192	0.065034	0.122306	0.010310	0.093558	0.106811	0.082863	0.075567	0.062105	0.054733	...	0.067264	0.081040	0.000136	0.077203	0.092987	0.078562	0.111519	0.071696	0.076484	0.085094
2	0.040771	0.016097	-0.133345	0.018963	0.011304	-0.033328	-0.007616	0.011458	-0.019993	0.005134	...	-0.005104	-0.024190	0.000077	0.002218	0.008920	0.004139	-0.031822	-0.012107	-0.003906	-0.021170
3	-0.027698	0.005129	-0.031516	-0.002639	0.022235	-0.008117	0.003969	0.019119	0.015403	0.005920	...	0.007992	0.027203	0.000003	0.000701	0.010739	0.005324	-0.007914	0.007168	0.004556	-0.003786
4	-0.021129	-0.019053	0.003273	-0.008121	0.002883	-0.004927	0.002548	-0.000599	0.028492	-0.012181	...	0.000198	-0.025817	-0.000047	-0.002800	-0.051515	-0.004861	0.015134	-0.000596	-0.010343	0.004530</pre></div>



<p class="wp-block-paragraph">The data looks good, so let&#8217;s continue.</p>



<h3 class="wp-block-heading">Step #2 Plotting Crypto Price Charts</h3>



<p class="wp-block-paragraph">Now that the data is available, we can visualize it in various line graphs. The visualization helps us better understand what kind of data we are dealing with and check if the download was successful.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create Prices Charts for all Cryptocurrencies
list_length = X_df_filtered.shape[1]
ncols = 10
nrows = int(round(list_length / ncols, 0))
height = list_length/3 if list_length &gt; 30 else 4
fig, axs = plt.subplots(nrows=nrows, ncols=ncols, sharex=True, sharey=True, figsize=(20, height))
for i, ax in enumerate(fig.axes):
        if i &lt; list_length:
            sns.lineplot(data=X_df_filtered, x=X_df_filtered.index, y=X_df_filtered.iloc[:, i], ax=ax)
            ax.set_title(X_df_filtered.columns[i])
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8232" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/price-charts-stock-price-prediction/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png" data-orig-size="1176,790" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="price-charts-stock-price-prediction" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png" src="https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction-1024x688.png" alt="Clustering Crypto Market Structures with Affinity Propagation: Daily Price Quotes for different Cryptocurrencies" class="wp-image-8232" width="937" height="629" srcset="https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/price-charts-stock-price-prediction.png 1176w" sizes="(max-width: 937px) 100vw, 937px" /></figure>



<p class="wp-block-paragraph">We can see the lineplots for all cryptocurrencies and everything looks as expected.</p>



<h3 class="wp-block-heading" id="h-step-3-clustering-cryptocurrencies-using-affinity-propagation">Step #3 Clustering Cryptocurrencies using Affinity Propagation</h3>



<p class="wp-block-paragraph">Next, we must prepare the data and run the affinity propagation algorithm. For some cryptocurrencies, we may encounter data that contains NaN values. Because clustering is sensitive to missing values, we must ensure good data quality. In addition, the Python code below will convert the DataFrame into a NumPy array and transpose it into a form where we have crypto assets as records and the days as columns.</p>



<p class="wp-block-paragraph">Running the code below returns a dictionary of clusters with the cryptocurrencies assigned to them by the affinity propagation algorithm.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Drop NaN values
X_df = pd.DataFrame(np.array(X_df_filtered)).dropna()
# Transpose the data to structure prices along columns
X = X_df.copy()
X /= X.std(axis=0)
X = np.array(X)
# Define an edge model based on covariance
edge_model = covariance.GraphicalLassoCV()
# Standardize the time series
edge_model.fit(X)
# Group cryptos to clusters using affinity propagation
# The number of clusters will be determined by the algorithm
cluster_centers_indices , labels = cluster.affinity_propagation(edge_model.covariance_, random_state=1)
cluster_dict = {}
n_labels = labels.max()
print(f&quot;{n_labels} Clusters&quot;)
for i in range(n_labels + 1):
    clusters = ', '.join(names[labels == i])
    print('Cluster %i: %s' % ((i + 1), clusters))
    cluster_dict[i] = (clusters)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">9 Clusters
Cluster 1: Binance Coin, Cake Defi
Cluster 2: Bitcoin Cash, Bitcoin, BitTorrent, Decred, EOS, Ethereum Classic, Ethereum, Ampleforth, Komodo, Solana, Sys Coin, DOT
Cluster 3: Celsius
Cluster 4: Doge Coin
Cluster 5: Cardano, ATOM, Avalance, Enjin, Internet Computer, Link, Loopring, Polygon, IOTA, NEO, Synthetix, Theta, Vechain
Cluster 6: Litecoin
Cluster 7: ACM Token, Atletico Madrid Token, Chilliz, Juventus Turin Token, PSG Token
Cluster 8: LRC
Cluster 9: Tether
Cluster 10: ARK, Battoken, BLZ, Digibyte, AS Rom Token, WAVES, Stellar Lumen, Monero, Ripple, Zilliqa, Zer0</pre></div>



<p class="wp-block-paragraph">We can see that the algorithm has identified 13 different clusters in the data and a couple of clusters with only a single member. You will most likely encounter different results depending on when you run it. </p>



<h3 class="wp-block-heading">Step #4 Create a 2D Positioning Model based on the Graph Structure</h3>



<p class="wp-block-paragraph">In addition to clusters, we want to show the covariance between cryptocurrencies in our Crypto Market map. We need a graph-like structure that contains the covariance and position data of the cryptocurrencies for each crypto pair.</p>



<p class="wp-block-paragraph">In addition, we use a node position model that calculates their relative position on a 2D plane from the covariance of the cryptocurrencies. However, the positions are only relative, so the absolute axes have no meaning.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create a node_position_model that find the best position of the cryptos on a 2D plane
# The number of components defines the dimensions in which the nodes will be positioned
node_position_model = manifold.LocallyLinearEmbedding(n_components=2, eigen_solver='dense', n_neighbors=20)
embedding = node_position_model.fit_transform(X.T).T
# The result are x and y coordindates for all cryptocurrencies
pd.DataFrame(embedding)
# Create an edge_model that represents the partial correlations between the nodes
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
# Only consider partial correlations above a specific threshold (0.02)
non_zero = (np.abs(np.triu(partial_correlations, k=1)) &gt; 0.02)
# Convert the Positioning Model into a DataFrame
data = pd.DataFrame.from_dict({&quot;embedding_x&quot;:embedding[0],&quot;embedding_y&quot;:embedding[1]})
# Add the labels to the 2D positioning model
data[&quot;labels&quot;] = labels
print(data.shape)
data.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">(48, 3)
	embedding_x	embedding_y	labels
0	0.400590	-0.136473	6
1	-0.081908	-0.086039	4
2	-0.033982	-0.038526	9
3	0.416745	0.076849	6
4	-0.041938	0.031966	4</pre></div>



<p class="wp-block-paragraph">The next step is to create a graph of the partial correlations. </p>



<h3 class="wp-block-heading">Step #5 Visualize the Crypto Market Structure</h3>



<p class="wp-block-paragraph">Our goal is to visualize differences in the covariance between crypto pairs by varying the connection strengths. We calculate the line strength by normalizing the covariance of the crypto pairs. In addition, we visualize the distribution of the covariance. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create an array with the segments for connecting the data points
start_idx, end_idx = np.where(non_zero) 
segments = [[np.array([embedding[:, start], embedding[:, stop]]).T, start, stop] for start, stop in zip(start_idx, end_idx)]
# Create a normalized representation of partial correlation between crypto currencies
# We can later use covariance to vizualize the strength of the connections
pc = np.abs(partial_correlations[non_zero])
normalized = (pc-min(pc))/(max(pc)-min(pc))
# plot the distribution of covariance between the cryptocurrencies
sns.histplot(pc)</pre></div>



<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="8251" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/covariance-histogram/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/covariance-histogram.png" data-orig-size="382,248" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="covariance-histogram" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/covariance-histogram.png" src="https://www.relataly.com/wp-content/uploads/2022/05/covariance-histogram.png" alt="cryptocurrency market structure visualization with affinity propagation, histogram of covariance between historical crypto prices" class="wp-image-8251" width="540" height="351" srcset="https://www.relataly.com/wp-content/uploads/2022/05/covariance-histogram.png 382w, https://www.relataly.com/wp-content/uploads/2022/05/covariance-histogram.png 300w" sizes="(max-width: 540px) 100vw, 540px" /></figure>



<p class="wp-block-paragraph">The hist plot shows that the covariance between the crypto pairs is mostly below 0.005.</p>



<p class="wp-block-paragraph">Finally, it is time to map cryptocurrencies on a 2D plane. To do this, we first define the cryptocurrencies using their relative position data with a scatterplot. We set the color of the points based on their clusters so that points in the same cluster are colored the same. Subsequently, we connect the points to the data from the edge model. The covariance between the crypto pairs determines the strength of their connections.</p>



<p class="wp-block-paragraph">We also define the color of the connections as follows. </p>



<ul class="wp-block-list">
<li>The map only shows connections with a covariance greater than 0.002.</li>



<li>Connections with a covariance greater than 0.05 are colored red. </li>



<li>Otherwise, connections between points within a cluster are shown in the cluster&#8217;s color. </li>



<li>We color connections in grey that are between points of different clusters.</li>
</ul>



<p class="wp-block-paragraph">Last but not least, we add the labels of the cryptocurrencies.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Visualization
plt.figure(1, facecolor='w', figsize=(20, 8))
plt.clf()
ax = plt.axes([0., 0., 1., 1.])

# Plot the nodes using the coordinates of our embedding
sc = sns.scatterplot(
    data=data,
    x=&quot;embedding_x&quot;,
    y=&quot;embedding_y&quot;,
    zorder=1,
    s=350 * d ** 2,
    c=labels,
    cmap=plt.cm.nipy_spectral,
    alpha=.9,
    #palette=&quot;muted&quot;,
)

# Plot the covariance edges between the nodes (scatter points)
line_strength = 3.2
    
for index, ((x, y), start, stop) in enumerate(segments):     
    norm_partial_correlation = normalized[index]
    if list(data.iloc[[start]]['labels'])[0] == list(data.iloc[[stop]]['labels'])[0]:
        if norm_partial_correlation &gt; 0.5:
            color = 'red'; linestyle='solid'
        else:
            color = plt.cm.nipy_spectral(list(data.iloc[[start]]['labels'])[0] / float(n_labels)); linestyle='solid'
    else:
        if norm_partial_correlation &gt; 0.5:
            color = 'red'; linestyle='solid'
        else:
            color = 'grey'; linestyle='dashed'
    # Plot the edges
    # if x and y larger than 0
    if x[0] &gt; 0 and y[0] &gt; 0:
        plt.plot(x, y, alpha=.4, zorder=0, linewidth=normalized[index]*line_strength, color=color, linestyle=linestyle)

    
# Labels the nodes and position the labels to avoid overlap with other labels
for name, label, (x, y) in zip(names, labels, embedding.T):
    color = plt.cm.nipy_spectral(label / float(n_labels))
    ax.annotate(
        name,
        xy=(x, y),
        xytext=(5, 2),
        textcoords='offset points',
        ha='right',
        va='bottom',
        fontsize=10,
        color='black',
        bbox=dict(facecolor='w', edgecolor=&quot;w&quot;, alpha=.0),
     )</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8229" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/cryptocurrency-map/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png" data-orig-size="1473,590" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="cryptocurrency-map" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png" src="https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map-1024x410.png" alt="Visualizing the Crypto Market Structure - Clusters of Cryptocurrencies determined by Affinity Propagation, Connections between cryptocurrencies defined by partial covariance in daily price fluctuations" class="wp-image-8229" width="1177" height="471" srcset="https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/cryptocurrency-map.png 1473w" sizes="(max-width: 1177px) 100vw, 1177px" /></figure>



<p class="wp-block-paragraph">Note that you will likely see a different map when you run the code on your machine. Differences result from changes in market prices and covariance that lead to other graph structures. </p>



<p class="wp-block-paragraph">Let&#8217;s see what the crypto market map tells us.</p>



<h4 class="wp-block-heading">Interpreting the Cryptomarket Map</h4>



<p class="wp-block-paragraph">The 2D crypto market map tells us several things:</p>



<ul class="wp-block-list">
<li>Most cryptos fall into the light green and dark green clusters corresponding to different types of crypto (Decentralized Finance Coins, NFT/Metaverse Coins).</li>



<li>There is a significant covariance between large-cap players in the crypto space, such as Cardano and Loopring and Ethereum and Bitcoin, which is plausible considering recent price movements. Some results are surprising, for example, the partial correlation between NEO and Ethereum Classic. </li>



<li>Some clusters are isolated and contain only a single member, for example, Tether, Komodo, AC Milan token, Wave token, and Dogecoin). The reason is that the prices of these coins/tokens have developed independently of the market.<ul><li> Tether is a stablecoin that does not change in price. It, therefore, strongly differs from the other cryptocurrencies on our map. </li></ul>
<ul class="wp-block-list">
<li>Komodo has been trading sideways without following the general market trend. </li>



<li>And the MCM token is a soccer token that has recently outperformed the market.</li>
</ul>
</li>



<li>Soccer tokens are colored in dark blue. These tokens&#8217; prices correlate with how the soccer clubs performed during the current season. It, therefore, makes perfect sense that these tokens are grouped into a cluster. An exception is the AC Milan token, which recently performed better than the other soccer tokens.</li>
</ul>



<h3 class="wp-block-heading">Step #6 Creating a 3D Representation</h3>



<p class="wp-block-paragraph">Instead of a 2D representation of the data points, we can also use a 3D node positioning model. For this purpose, the node positioning model distributes the affinity values over three dimensions.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Find the best position of the cryptos on a 3D plane
node_position_model = manifold.LocallyLinearEmbedding(n_components=3, eigen_solver='dense', n_neighbors=20)
embedding = node_position_model.fit_transform(X.T).T
# The result are x and y coordindates for all cryptocurrencies
pd.DataFrame(embedding)
# Display a graph of the partial correlations
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
non_zero = (np.abs(np.triu(partial_correlations, k=1)) &gt; 0.02)
data = pd.DataFrame.from_dict({&quot;embedding_x&quot;:embedding[0],&quot;embedding_y&quot;:embedding[1],&quot;embedding_z&quot;:embedding[1]})
data[&quot;labels&quot;] = labels
data[&quot;names&quot;] = names
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(figsize=(20,20))
ax = fig.add_subplot(projection='3d')
xs = data[&quot;embedding_x&quot;]
ys = data[&quot;embedding_y&quot;]
zs = data[&quot;embedding_z&quot;]
sc = ax.scatter(xs, ys, zs, c=labels, s=100)
    
for i in range(len(data)):
    x = xs[i]
    y = ys[i]
    z = zs[i]
    label = data[&quot;names&quot;][i]
    ax.text(x, y, z, label)
    
plt.legend(*sc.legend_elements(), bbox_to_anchor=(1.05, 1), loc=2)
plt.show()</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8243" data-permalink="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/3d-cluster-representation-affinity-propagation/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png" data-orig-size="1210,1101" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="3d-cluster-representation-affinity-propagation" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png" src="https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation-1024x932.png" alt="3d representation of the cryptocurrency market structure, affinity propagation relataly" class="wp-image-8243" width="891" height="811" srcset="https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png 1024w, https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png 300w, https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png 768w, https://www.relataly.com/wp-content/uploads/2022/05/3d-cluster-representation-affinity-propagation.png 1210w" sizes="(max-width: 891px) 100vw, 891px" /></figure>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Affinity propagation is a powerful technique for clustering items when the optimal number of clusters is unknown. In this article, we&#8217;ve demonstrated how to apply affinity propagation to analyze the cryptocurrency market and identify groups of assets based on similar price fluctuations.</p>



<p class="wp-block-paragraph">In our example, we identified 13 groups of cryptocurrencies without specifying the number of clusters in advance. We also visualized the market structure on a 2D and 3D map using a node distribution technique. This approach can be extended to analyze and cluster stock markets, highlighting complex price patterns among multiple financial assets.</p>



<p class="wp-block-paragraph">Once you&#8217;ve identified clusters, you can dive deeper into individual groups. Sometimes, outliers that temporarily break out of their usual pattern indicate interesting investment opportunities. These outliers can eventually return to the price pattern of their group, or they may represent forerunners of their group, indicating broader market movements.</p>



<p class="wp-block-paragraph">By using affinity propagation, we can visualize financial assets in a new and exciting way. If you have any questions or comments about this approach, please let me know.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h2 class="wp-block-heading" id="h-sources-and-further-reading">Sources and Further Reading</h2>



<p class="wp-block-paragraph">This article modifies some of the code from <a href="https://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html#sphx-glr-auto-examples-applications-plot-stock-market-py" target="_blank" rel="noreferrer noopener">Scikit-learn and adapts it from the stock market</a> to cryptocurrencies.</p>



<ol class="wp-block-list">
<li><a href="https://amzn.to/3yIQdWi" target="_blank" rel="noreferrer noopener">Jansen (2020) Machine Learning for Algorithmic Trading: Predictive models to extract signals from market and alternative data for systematic trading strategies with Python</a></li>



<li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems </a></li>



<li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>



<li><a href="https://amzn.to/3MAy8j5" target="_blank" rel="noreferrer noopener">Andriy Burkov (2020) Machine Learning Engineering</a></li>



<li>Images are created using Midjourney, an AI that creates images from text.</li>
</ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>
<p>The post <a href="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/">Unveiling Hidden Patterns in the Cryptocurrency Market with Affinity Propagation and Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">8114</post-id>	</item>
		<item>
		<title>Cluster Analysis with k-Means in Python</title>
		<link>https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/</link>
					<comments>https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Sun, 27 Jun 2021 18:21:01 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Correlation]]></category>
		<category><![CDATA[Covariance]]></category>
		<category><![CDATA[K-Means]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[Synthetic Data]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[Beginner Tutorials]]></category>
		<category><![CDATA[Unsupervised Learning]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=5070</guid>

					<description><![CDATA[<p>Embark on a journey into the world of unsupervised machine learning with this beginner-friendly Python tutorial focusing on K-Means clustering, a powerful technique used to group similar data points into distinct clusters. This invaluable tool helps us make sense of complex datasets, finding hidden patterns and associations without the need for a predetermined target variable. ... <a title="Cluster Analysis with k-Means in Python" class="read-more" href="https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/" aria-label="Read more about Cluster Analysis with k-Means in Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/">Cluster Analysis with k-Means in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Embark on a journey into the world of unsupervised machine learning with this beginner-friendly Python tutorial focusing on K-Means clustering, a powerful technique used to group similar data points into distinct clusters. This invaluable tool helps us make sense of complex datasets, finding hidden patterns and associations without the need for a predetermined target variable. This comes in handy, especially when dealing with data whose similarities and differences aren&#8217;t immediately apparent.</p>



<p class="wp-block-paragraph">Divided into two insightful sections, this blog post first delves into the theoretical foundation of the K-Means clustering algorithm. We&#8217;ll explore its real-world applications, its strengths, and its weaknesses, providing a comprehensive overview of what makes K-Means an essential tool in any data scientist&#8217;s toolkit.</p>



<p class="wp-block-paragraph">In the second part of the blog, we switch gears and dive into a hands-on Python tutorial. Here, we&#8217;ll walk you through a practical example of K-Means clustering, using it to uncover three distinct spherical clusters within a synthetic dataset. To round up, we&#8217;ll put our model to the test, using it to predict clusters within a test dataset, and then we&#8217;ll visualize the results for easy understanding.</p>



<p class="wp-block-paragraph">Whether you&#8217;re a seasoned data scientist or a beginner looking to dip your toes into the field, this blog post offers a simple and accessible introduction to cluster engineering with K-Means in Python. Follow along, and uncover the hidden potential of your data today!</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" width="508" height="512" data-attachment-id="12672" data-permalink="https://www.relataly.com/rubber-ducks-machine-learning-python-tutorial-relataly-midjourney-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/rubber-ducks-machine-learning-python-tutorial-relataly-midjourney-min.png" data-orig-size="508,512" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="rubber ducks machine learning python tutorial relataly midjourney-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/rubber-ducks-machine-learning-python-tutorial-relataly-midjourney-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/rubber-ducks-machine-learning-python-tutorial-relataly-midjourney-min.png" alt="Patterns are everywhere, and machine learning can help to expose and understand them better. Image created with Midjourney. " class="wp-image-12672" srcset="https://www.relataly.com/wp-content/uploads/2023/03/rubber-ducks-machine-learning-python-tutorial-relataly-midjourney-min.png 508w, https://www.relataly.com/wp-content/uploads/2023/03/rubber-ducks-machine-learning-python-tutorial-relataly-midjourney-min.png 298w, https://www.relataly.com/wp-content/uploads/2023/03/rubber-ducks-machine-learning-python-tutorial-relataly-midjourney-min.png 140w" sizes="(max-width: 508px) 100vw, 508px" /><figcaption class="wp-element-caption">Patterns are everywhere, and machine learning can help to expose and understand them better. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>. </figcaption></figure>
</div>
</div>
</div>
</div>



<h2 class="wp-block-heading" id="h-findings-clusters-in-multivariate-data-with-k-means">Findings Clusters in Multivariate Data with k-Means</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">K-Means clustering is a prominent unsupervised machine learning algorithm utilized to partition datasets into a predefined number (&#8216;k&#8217;) of distinct clusters, each brimming with similar data points. To begin, the algorithm identifies &#8216;k&#8217; random data points which serve as the initial centroids &#8211; the central points representing each cluster. It then proceeds in an iterative fashion, continuously reassigning each data point to the centroid nearest to it, followed by updating each centroid to the average of the data points assigned to its cluster. This process repeats until the centroids no longer change positions or until a set maximum number of iterations is reached.</p>



<p class="wp-block-paragraph">The underlying objective of the K-Means algorithm is to minimize the within-cluster sum of squares (WCSS), a metric indicating the cumulative distance between each data point within a cluster and its corresponding centroid. By reducing the WCSS, K-Means aims to form the most compact and clearly separable clusters.</p>



<p class="wp-block-paragraph">The K-Means algorithm has garnered substantial popularity in the field of data science for its speed and simplicity in implementation. However, it isn&#8217;t without its limitations. It requires the number of clusters to be specified beforehand and operates under the assumption that all clusters are spherical and uniformly sized. In the following sections, we&#8217;ll delve into the intricacies of this powerful yet straightforward algorithm and discuss its use in real-world applications.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="5079" data-permalink="https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/image-83-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-83.png" data-orig-size="567,452" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-83" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-83.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-83.png" alt="Three clusters in a dataset that were separated by the k-Means algorithms" class="wp-image-5079" width="355" height="283" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-83.png 567w, https://www.relataly.com/wp-content/uploads/2021/06/image-83.png 300w" sizes="(max-width: 355px) 100vw, 355px" /><figcaption class="wp-element-caption">Three clusters in a dataset that were separated by the k-Means algorithms</figcaption></figure>
</div>
</div>



<p class="wp-block-paragraph"></p>



<h3 class="wp-block-heading" id="h-understanding-the-mechanics-of-k-means-clustering-algorithm">Understanding the Mechanics of K-Means Clustering Algorithm</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">At its core, K-Means clustering is a dynamic algorithm designed to simplify complex data patterns. This machine learning model is a champion of unsupervised learning, adept at grouping similar data points into a predetermined number (&#8216;k&#8217;) of unique clusters.</p>



<p class="wp-block-paragraph">Starting the process, the K-Means algorithm selects &#8216;k&#8217; random data points, which act as the initial centroids or the central figures of the clusters. The algorithm then goes through a series of iterative steps, continually allocating each data point to its closest centroid, and recalibrating the centroid to be the mean of all data points that are assigned to its cluster. This cycle repeats until we achieve stable centroids, i.e., they stop moving, or until a pre-set maximum number of iterations is completed.</p>



<p class="wp-block-paragraph">The fundamental goal of K-Means lies in minimizing the within-cluster sum of squares (WCSS). This term refers to the sum of the distances of each data point in a cluster to its centroid. By lowering WCSS, K-Means strives to construct compact clusters that are distinctly separate from each other.</p>



<p class="wp-block-paragraph">In the illustration below, we assume a cluster with three cluster centers. K-Means carries out several steps to partition the data.</p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<ol class="wp-block-list">
<li>Initially, the algorithm k chooses random starting positions for the centroids. Alternatively, we can also position the centroids manually.</li>



<li>Then the algorithm calculates the distance between the data points and the three centroids. The data points are assigned to the closest centroid/cluster where the cluster variance increases the least.</li>



<li>Next, the algorithm calculates the Euclidean distance between the centroids and their assigned data points. The result is linear decision boundaries that separate the clusters but are not yet optimal.</li>



<li>From then on, the algorithm optimizes the centroids&#8217; positions to lower the resulting clusters&#8217; variance. Then, the previous steps are repeated: averaging, assigning the data points to clusters, and shifting the centroids.</li>
</ol>



<p class="wp-block-paragraph">The process ends when the positions of the centroids do not change anymore.</p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="5099" data-permalink="https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/image-88-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-88.png" data-orig-size="1183,1093" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-88" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-88.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-88-1024x946.png" alt="K-means partitions a set of data by iteratively optimizing the centroids and the decision boundaries that result from them. " class="wp-image-5099" width="787" height="726" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-88.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-88.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-88.png 1183w" sizes="(max-width: 787px) 100vw, 787px" /><figcaption class="wp-element-caption">k-Means iteratively optimizes the decision boundaries between clusters</figcaption></figure>
</div>
</div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-how-many-clusters">How many Clusters?</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">A particular challenge is that k-means requires estimating the number of clusters k. When tackling new clustering problems, we usually don&#8217;t know the optimal number of clusters. Unless the data is not too complex, we can often estimate the number of centers by looking at one or more scatter plots. However, this approach only works when the data has a few dimensions. For complex data with many dimensions, it is common to experiment with varying numbers of k to find an appropriate size for the problem. We can automate this process using hyperparameter tuning techniques such as <a href="https://www.relataly.com/hyperparameter-tuning-with-grid-search/2261/" target="_blank" rel="noreferrer noopener">grid search</a> or <a href="https://www.relataly.com/using-random-search-to-tune-the-hyperparameters-of-a-random-decision-forest-with-python/6875/" target="_blank" rel="noreferrer noopener">random search</a>. The idea is to try out different cluster sizes and identify the size that best differentiates between clusters.</p>



<h3 class="wp-block-heading" id="h-pros-and-cons-of-k-means-clustering">Pros and Cons of k-Means Clustering</h3>



<p class="wp-block-paragraph">Although K-Means is esteemed for its swift operation and uncomplicated implementation, it comes with a set of constraints. We need to know the strengths and weaknesses of clustering techniques such as k-Means. In general, clustering can reveal structures and relationships in data supervised machine learning methods like classification likely would not uncover. In particular, when we suspect different subgroups in the data that differ in their behavior, clustering can help discover what makes these groups unique.</p>



<p class="wp-block-paragraph">K-Means, in particular, possesses a unique ability to detect and segregate spherical-shaped clusters efficiently. However, its performance may falter when faced with clusters embodying more intricate structures like half-moons or circles, often struggling to differentiate them accurately.</p>



<p class="wp-block-paragraph">Another potential limitation of K-Means lies in its prerequisite of specifying the number of clusters in advance. This requirement can prove challenging when the true number of clusters within a dataset is unknown or ambiguous. In such scenarios, alternative clustering techniques, such as <a href="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/" target="_blank" rel="noreferrer noopener">affinity propagation</a> or <a href="https://wp.me/pbUnJO-2WP">hi</a><a href="https://wp.me/pbUnJO-2WP" target="_blank" rel="noreferrer noopener">erarchical clustering</a>, might be more suitable, as they possess the ability to automatically determine the optimal number of clusters.</p>



<h3 class="wp-block-heading" id="h-applications-of-clustering">Applications of Clustering</h3>



<p class="wp-block-paragraph">The k-means algorithm is used in a variety of applications, including the following:</p>



<ul class="wp-block-list">
<li>A typical use case for clustering is in <strong>marketing and market segmentation</strong>. Here clustering is used to identify meaningful segments of similar customers. The similarity can be based on demographic data (age, gender, etc.) or customer behavior (for example, the time and amount of a purchase).</li>



<li><strong>Medical research</strong> uses clustering to divide patient groups into different subgroups, for example, to assess stroke risk. After clustering, the next step is to develop separate prediction models for the subgroups to estimate the risk more accurately.</li>



<li>An application in the financial sector is outlier detection in <strong>fraud detection</strong>. Banks and credit card companies use clustering to detect unusual transactions and flag them for verification. </li>



<li><strong>Spam filtering</strong>: The input data are attributes of emails (text length, words contained, etc.) and help separate spam from non-spam emails.</li>
</ul>



<h2 class="wp-block-heading" id="h-implementing-a-k-means-clustering-model-in-python">Implementing a K-Means Clustering Model in Python</h2>



<p class="wp-block-paragraph">In the following, we run a cluster analysis on <a href="https://www.relataly.com/category/apis/synthetic-data/" target="_blank" rel="noreferrer noopener">synthetic data</a> using Python and scikit-learn. We aim to train a K-Means cluster model in Python that distinguishes three clusters in the data. Given that the data is synthetic, we&#8217;re already privy to which cluster each data point pertains to. This foreknowledge enables us to evaluate the performance of our model post-training, gauging how effectively it can differentiate between the three predefined clusters.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository.</p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_3c55f5-f1"><a class="kb-button kt-button button kb-btn_79ce02-f1 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/03%20Clustering/041%20Simple%20Clustering%20using%20K-means%20with%20Python.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_172639-a2 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before beginning the coding part, make sure that you have set up your Python 3 environment and required packages. If you don&#8217;t have an environment, consider the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda Python environment</a>. To set it up, you can follow the steps in&nbsp;<a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a>.</p>



<p class="wp-block-paragraph">Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p class="wp-block-paragraph">This article uses the k-means clustering algorithm from the Python library <a rel="noreferrer noopener" href="https://scikit-learn.org/stable/" target="_blank">Scikit-learn</a>. We also use <a data-type="URL" data-id="https://seaborn.pydata.org/" href="https://seaborn.pydata.org/" target="_blank" rel="noreferrer noopener">Seaborn </a>for visualization.</p>



<p class="wp-block-paragraph">You can install these libraries using console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-step-1-generate-synthetic-data">Step #1: Generate Synthetic Data</h3>



<p class="wp-block-paragraph">We start with the generation of synthetic data. For this purpose, we use the make_blobs function from the scikit-learn library. The function generates random clusters in two dimensions spherically arranged around a center. In addition, the data contains the respective cluster to which the data points belong. We use a scatterplot to visualize the data. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split as train_test_split
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

# Generate synthetic data
features, labels = make_blobs(
    n_samples=400,
    centers=3,
    cluster_std=2.75,
    random_state=42
)

# Visualize the data in scatterplots
def scatter_plots(df, palette):
    fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(20, 8))
    fig.subplots_adjust(hspace=0.5, wspace=0.2)
    

    ax1 = plt.subplot(1,2,1)
    sns.scatterplot(ax = ax1, data=df, x='x', y='y', hue= 'labels', palette=palette)
    plt.title('Clusters')
    
    ax2 = plt.subplot(1,2,2)
    sns.scatterplot(ax = ax2, data=df, x='x', y='y')
    plt.title('How the model sees the data during training')

palette = {1:&quot;tab:cyan&quot;,0:&quot;tab:orange&quot;, 2:&quot;tab:purple&quot;}
df = pd.DataFrame(features, columns=['x', 'y'])
df['labels'] = labels
scatter_plots(df, palette)</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="433" data-attachment-id="5061" data-permalink="https://www.relataly.com/image-77-2/" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-77.png" data-orig-size="1172,496" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-77" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-77.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-77-1024x433.png" alt="Clusters vs how the model sees the data during training" class="wp-image-5061" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-77.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-77.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-77.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-77.png 1172w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<h3 class="wp-block-heading" id="h-step-2-preprocessing">Step #2: Preprocessing</h3>



<p class="wp-block-paragraph">There are some general things to keep in mind when preparing data for use with K-Means clustering:</p>



<ul class="wp-block-list">
<li><strong>Missing data and outliers</strong>: if we have missing entries in our data, we need to handle these, for example, by removing the records or filling the missing values with a mean or median. In addition, K-means is sensitive to outliers. Therefore, make sure that you eliminate outliers from the training data. </li>



<li><strong>Normalization</strong>: K-Means can only deal with integer values. So, either we map the categorical variables to integer values or use one-hot-encoding to create separate binary variables. </li>



<li><strong>Dimensionality reduction</strong>: In general, having too many variables in a dataset can negatively affect the performance of clustering algorithms. A good practice is to keep the number of variables below 30, for example, by using techniques for dimensionality reduction such as Principal-Component-Analysis.</li>



<li><strong>Scaling</strong>: Important to note is also that K-means require scaling the data.</li>
</ul>



<p class="wp-block-paragraph">Our synthetic dataset is free of outliers or missing values. Therefore, we only need to scale the data. In addition, we separate the class labels of the clusters from the training set and split the data into a train- and a test dataset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Scale the data
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

X = scaled_features #Training data
y = labels #Prediction label

# Split the data into x_train and y_train data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)</pre></div>



<h3 class="wp-block-heading" id="h-step-3-train-a-k-means-clustering-model">Step #3: Train a k-Means Clustering Model</h3>



<p class="wp-block-paragraph">Once we have prepared the data, we can begin the cluster analysis by training a K-means model. Our model uses the k-means algorithm from Python <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html?highlight=kmeans#sklearn.cluster.KMeans" target="_blank" rel="noreferrer noopener">scikit-learn library</a>. We have various options to configure the clustering process: </p>



<ul class="wp-block-list">
<li><strong>n_clusters</strong><em>: </em>The number of clusters we expect in the data. </li>



<li><strong>n_init</strong><em>: </em>The number of iterations k-means will run with different initial centroids. The algorithm returns the best model. </li>



<li><strong>max_iter</strong><em>: The max number of iterations for a single run</em></li>
</ul>



<p class="wp-block-paragraph">We expect three clusters and configure the algorithm to run ten iterations.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create a k-means model with n clusters
kmeans = KMeans(
    init=&quot;random&quot;,
    n_clusters=3,
    n_init=10,
    max_iter=300,
    random_state=42
)

# fit the model
kmeans.fit(X_train)
print(f'Converged after {kmeans.n_iter_} iterations')</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Converged after 4 iterations</pre></div>



<p class="wp-block-paragraph">Our model has already converged after four iterations. Next, we will look at the results.</p>



<h3 class="wp-block-heading" id="h-step-4-make-and-visualize-predictions">Step #4: Make and Visualize Predictions</h3>



<p class="wp-block-paragraph">Next, we&#8217;ll be delving into the practical aspect of our Python tutorial by analyzing the performance of our trained K-Means model using synthetic data.</p>



<p class="wp-block-paragraph">In the following Python code, we:</p>



<ul class="wp-block-list">
<li>Extract the cluster centers from the trained K-Means model and unscale them.</li>



<li>Add the predicted and actual labels to our unscaled training data.</li>



<li>Define a function, <code>scatter_plots()</code>, that creates two scatter plots: one for the predicted labels and another for the actual labels.</li>



<li>Call this function, passing the training data, cluster centers, and a color palette as arguments.</li>
</ul>



<p class="wp-block-paragraph">Please note:</p>



<ul class="wp-block-list">
<li>We use a dictionary as a color palette to differentiate between clusters.</li>



<li>The colors between the two plots may not match due to K-Means assigning numbers to clusters without prior knowledge of initial labels.</li>
</ul>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Get the cluster centers from the trained K-means model
cluster_center = scaler.inverse_transform(kmeans.cluster_centers_)
df_cluster_centers = pd.DataFrame(cluster_center, columns=['x', 'y'])

# Unscale the predictions
X_train_unscaled = scaler.inverse_transform(X_train)
df_train = pd.DataFrame(X_train_unscaled, columns=['x', 'y'])
df_train['pred_label'] = kmeans.labels_
df_train['true_label'] = y_train

def scatter_plots(df, cc, palette):
    fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(20, 8))
    fig.subplots_adjust(hspace=0.5, wspace=0.2)
    
    # Print the predictions    
    ax2 = plt.subplot(1,2,1)
    sns.scatterplot(ax = ax2, data=df, x='x', y='y', hue='pred_label', palette=palette)
    sns.scatterplot(ax = ax2, data=cc, x='x', y='y', color='r', marker=&quot;X&quot;)
    plt.title('Predicted Labels')

    # Print the actual values
    ax1 = plt.subplot(1,2,2)
    sns.scatterplot(ax = ax1, data=df, x='x', y='y', hue= 'true_label', palette=palette)
    sns.scatterplot(ax = ax1, data=cc, x='x', y='y', color='r', marker=&quot;X&quot;)
    plt.title('Actual Labels')


# The colors between the two plots may not match.
# This is because K-means does not know the initial labels and assigns numbers to clusters 
palette = {1:&quot;tab:cyan&quot;,0:&quot;tab:orange&quot;, 2:&quot;tab:purple&quot;}
scatter_plots(df_train, df_cluster_centers, palette)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="5062" data-permalink="https://www.relataly.com/image-78-2/" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-78.png" data-orig-size="1172,496" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-78" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-78.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-78-1024x433.png" alt="actual clusters vs predicted cluster of the K-means model" class="wp-image-5062" width="1106" height="468" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-78.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-78.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-78.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-78.png 1172w" sizes="(max-width: 1106px) 100vw, 1106px" /></figure>



<p class="wp-block-paragraph">The scatterplot above shows that K-Means found the three clusters. As a side note, the colors between the two plots do not match because K-means does not know the initial labels and assigns numbers to clusters. </p>



<h3 class="wp-block-heading" id="h-step-5-measuring-model-performance">Step #5: Measuring Model Performance</h3>



<p class="wp-block-paragraph">Next, we measure the performance of our clustering model. K-means is an unsupervised machine learning algorithm, which means that it is used to cluster data points into groups based on their similarity, without the use of labeled training data. However, we can compare the cluster assignments to the labels attached to the data and see if the model can predict them correctly. In this way, we can use traditional classification metrics such as accuracy and f1_score to measure the performance of our clustering model.</p>



<p class="wp-block-paragraph">First, we unify the cluster labels as A, B, and C.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># The predictive model has the labels 0 and 1 reversed. We will correct that first. 
#df_train['pred_test'] = df_train['pred_labels'].map({2:2, 1:3, 0:1})
df_eval = df_train.copy()
df_eval['true_label'] = df_eval['true_label'].map({0:'A', 1:'B', 2:'C'})
df_eval['pred_label'] = df_eval['pred_label'].map({0:'B', 1:'A', 2:'C'})
df_eval.head(10)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">	x			y			pred_label	true_label
0	-9.007547	-10.302910	C			C
1	1.009238	7.009681	A			B
2	-6.565501	-6.466780	C			C
3	2.389772	7.727235	B			B
4	-5.422666	-2.915796	C			C
5	-12.024305	-7.846772	C			C
6	-4.006250	9.319323	A			A
7	-6.297788	6.435267	A			A
8	2.169238	3.325947	B			B
9	-5.140506	-4.205585	C			C</pre></div>



<p class="wp-block-paragraph">It is a common practice to create scatterplots on the predictions to visually verify the quality of the clusters and their decision boundaries. The following scatter plot shows the correctly assigned values and where our model was wrong.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Scatterplot on correctly and falsely classified values
df_eval.loc[df_eval['pred_label'] == df_eval['true_label'], 'Correctly classified?'] = 'True' 
df_eval.loc[df_eval['pred_label'] != df_eval['true_label'], 'Correctly classified?'] = 'False' 

plt.rcParams[&quot;figure.figsize&quot;] = (10,8)
sns.scatterplot(data=df_eval, x='x', y='y', color='r', hue='Correctly classified?')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="5065" data-permalink="https://www.relataly.com/image-81-2/" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-81.png" data-orig-size="614,480" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-81" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-81.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-81.png" alt="A 2-d scatterplot that shows the correctly assigned values and where our  K-means model was wrong." class="wp-image-5065" width="711" height="556" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-81.png 614w, https://www.relataly.com/wp-content/uploads/2021/06/image-81.png 300w" sizes="(max-width: 711px) 100vw, 711px" /></figure>



<p class="wp-block-paragraph">The K-Means model correctly assigned most data points (blue) to their actual cluster. The few misclassified points are located at a decision boundary between two clusters (marked in orange).</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># Create a confusion matrix
def evaluate_results(model, y_true, y_pred, class_names):
    tick_marks = [0.5, 1.5, 2.5]
    
    # Print the Confusion Matrix
    fig, ax = plt.subplots(figsize=(10, 6))
    results_log = classification_report(y_true, y_pred, target_names=class_names, output_dict=True)
    results_df_log = pd.DataFrame(results_log).transpose()
    matrix = confusion_matrix(y_true,  y_pred)
    model_score = score(y_pred, y_true, average='macro')
    
    sns.heatmap(pd.DataFrame(matrix), annot=True, fmt=&quot;d&quot;, linewidths=.5, cmap=&quot;YlGnBu&quot;)
    plt.xlabel('Predicted label'); plt.ylabel('Actual label')
    plt.title('Confusion Matrix on the Test Data')
    plt.yticks(tick_marks, class_names); plt.xticks(tick_marks, class_names)
    
    print(results_df_log)


y_true = df_eval['true_label']
y_pred = df_eval['pred_label']
class_names = ['A', 'B', 'C']
evaluate_results(kmeans, y_true, y_pred, class_names)</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="5064" data-permalink="https://www.relataly.com/image-80-2/" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-80.png" data-orig-size="682,588" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-80" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-80.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-80.png" alt="confusion matrix on the predictions generated by our k-means model" class="wp-image-5064" width="658" height="568" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-80.png 682w, https://www.relataly.com/wp-content/uploads/2021/06/image-80.png 300w" sizes="(max-width: 658px) 100vw, 658px" /></figure>



<p class="wp-block-paragraph">As we can see, the model has done a good job of grouping the labels into the attached classes.</p>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p class="wp-block-paragraph">algorithm, capable of parsing intricate datasets into discrete, non-overlapping clusters. With this guide, you should have a clear understanding of how this algorithm operates, as well as its unique strengths and potential limitations. Remember, k-Means excels when dealing with spherical clusters but may struggle with clusters of more complex shapes.</p>



<p class="wp-block-paragraph">We walked through how to implement the k-Means algorithm in Python, using it to identify spherical clusters within synthetic data. Moreover, we explored different methods to evaluate and visualize a clustering model&#8217;s performance, providing you with valuable tools to effectively analyze and group your data.</p>



<p class="wp-block-paragraph">Whether it&#8217;s anomaly detection in time-series data or customer segmentation, k-Means can be a powerful tool in your data science arsenal. If you&#8217;re interested in anomaly detection, feel free to explore my recent article on the subject, written with Python enthusiasts in mind. Armed with this knowledge, you&#8217;re ready to embark on your own data exploration journey using k-Means clustering.</p>



<p class="wp-block-paragraph">If you are interested in this topic, check out my recent article on <a href="https://www.relataly.com/multivariate-outlier-detection-using-isolation-forests-in-python-detecting-credit-card-fraud/4233/" target="_blank" rel="noreferrer noopener">anomaly detection with Python</a>. And if you have any questions, please ask them in the comments.</p>



<p class="wp-block-paragraph">Did you know that customer segmentation is an area where real-world data can be prone to bias and unfairness? If you&#8217;re concerned that your models may reflect the same bias, check out our latest article on <a href="https://www.relataly.com/building-fair-machine-machine-learning-models-with-fairlearn/12804/" target="_blank" rel="noreferrer noopener">addressing fairness in machine learning with fairlearn</a>.</p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<h4 class="wp-block-heading"><strong>Books on Clustering</strong></h4>



<ul class="wp-block-list">
<li><a href="https://amzn.to/3Gb5kfj" target="_blank" rel="noreferrer noopener">&#8220;Data Clustering: Algorithms and Applications&#8221; by Charu C. Aggarwal</a>: This book covers a wide range of clustering algorithms, including hierarchical clustering, and discusses their applications in various fields.</li>



<li><a href="https://amzn.to/3WmhGXB" target="_blank" rel="noreferrer noopener">&#8220;Data Mining: Practical Machine Learning Tools and Techniques&#8221; by Ian H. Witten and Eibe Frank</a>: This book is a comprehensive introduction to data mining and machine learning, including a chapter on hierarchical clustering.</li>
</ul>



<div style="display: inline-block;">
<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=0128042915&amp;asins=0128042915&amp;linkId=1e9fe160a76f7255e3eea8e0119ca74f&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>

<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=B00EYROAQU&amp;asins=B00EYROAQU&amp;linkId=ba1fcb8a59417e729afe33f6eceb2a9f&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>
</div></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<h4 class="wp-block-heading"><strong>Books on Machine Learning</strong></h4>



<ul class="wp-block-list">
<li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning</a></li>



<li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li>
</ul>



<div style="display: inline-block;">

  <iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=3030181162&amp;asins=3030181162&amp;linkId=669e46025028259138fbb5ccec12dfbe&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>

<iframe sandbox="allow-popups allow-scripts allow-modals allow-forms allow-same-origin" style="width:120px;height:240px;" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" src="//ws-eu.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&amp;OneJS=1&amp;Operation=GetAdHtml&amp;MarketPlace=DE&amp;source=ss&amp;ref=as_ss_li_til&amp;ad_type=product_link&amp;tracking_id=flo7up-21&amp;language=de_DE&amp;marketplace=amazon&amp;region=DE&amp;placement=1492032646&amp;asins=1492032646&amp;linkId=2214804dd039e7103577abd08722abac&amp;show_border=true&amp;link_opens_in_new_window=true"></iframe>
</div>
</div></div>
</div>
</div>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<h4 class="wp-block-heading">Related Articles on Clustering</h4>



<ol class="wp-block-list">
<li><a href="https://wp.me/pbUnJO-2WP" target="_blank" rel="noreferrer noopener">Hierarchical clustering for customer segmentation in Python</a>: We present an alternative approach to k-means that can determine the number of the clusters themselves and works well if the distributions of the groups in the data is more complex.</li>



<li><a href="https://www.relataly.com/crypto-market-cluster-analysis-using-affinity-propagation-python/8114/" target="_blank" rel="noreferrer noopener">Clustering crypto markets using affinity propagation in Python</a>: This article applies cluster analysis to crypto markets and creates a market map for various cryptocurrencies.</li>
</ol>
<p>The post <a href="https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/">Cluster Analysis with k-Means in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/simple-cluster-analysis-with-k-means-with-python/5070/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">5070</post-id>	</item>
	</channel>
</rss>
