<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Recommender Systems Archives - relataly.com</title>
	<atom:link href="https://www.relataly.com/category/machine-learning/recommender-systems/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.relataly.com/category/machine-learning/recommender-systems/</link>
	<description>The Business AI Blog</description>
	<lastBuildDate>Sat, 27 May 2023 10:18:00 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://www.relataly.com/wp-content/uploads/2023/04/cropped-AI-cat-Icon-White.png</url>
	<title>Recommender Systems Archives - relataly.com</title>
	<link>https://www.relataly.com/category/machine-learning/recommender-systems/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">175977316</site>	<item>
		<title>Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python</title>
		<link>https://www.relataly.com/content-based-movie-recommender-using-python/4294/</link>
					<comments>https://www.relataly.com/content-based-movie-recommender-using-python/4294/#respond</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Mon, 25 Jul 2022 11:29:00 +0000</pubDate>
				<category><![CDATA[Content-based Filtering]]></category>
		<category><![CDATA[Correlation]]></category>
		<category><![CDATA[Cosine Similarity]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Recommender Systems]]></category>
		<category><![CDATA[Retail]]></category>
		<category><![CDATA[Scikit-Learn]]></category>
		<category><![CDATA[Seaborn]]></category>
		<category><![CDATA[AI in E-Commerce]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[Supervised Learning]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=4294</guid>

					<description><![CDATA[<p>Content-based recommender systems are a popular type of machine learning algorithm that recommends relevant articles based on what a user has previously consumed or liked. This approach aims to identify items with certain keywords, understand what the customer likes, and then identify other items that are similar to items the user has previously consumed or ... <a title="Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python" class="read-more" href="https://www.relataly.com/content-based-movie-recommender-using-python/4294/" aria-label="Read more about Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/content-based-movie-recommender-using-python/4294/">Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">Content-based recommender systems are a popular type of machine learning algorithm that recommends relevant articles based on what a user has previously consumed or liked. This approach aims to identify items with certain keywords, understand what the customer likes, and then identify other items that are similar to items the user has previously consumed or rated. The recommendations are based on the similarity of the items, represented by similarity scores in a vector matrix. The attributes used to describe an item are called &#8220;content.&#8221; For example, in the case of movie recommendations, content could be the genre, actors, director, year of release, etc. A well-designed content-based recommendation service will suggest movies of the same genre, actors, or keywords. This tutorial will implement a content-based recommendation service for movies using Python and Scikit-learn. </p>



<p class="wp-block-paragraph">The rest of this tutorial proceeds as follows: After a brief introduction to content-based recommenders, we will work with a database that contains several thousands of <a href="https://www.imdb.com/" target="_blank" rel="noreferrer noopener">IMDB </a>movie titles and create a feature model that uses actors, release year, and a short description for each movie. In this tutorial, you will also learn how to deal with some challenges of building a content-based recommender. For example, we will look at how we can engineer features for content-based model words and reduce the dimensionality of our model. Finally, we use our model to generate some sample predictions.</p>



<p class="wp-block-paragraph">Note: Another popular type of recommender system that I have covered in a previous article is <a href="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/" target="_blank" rel="noreferrer noopener">collaborative filtering</a>.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img fetchpriority="high" decoding="async" width="508" height="508" data-attachment-id="12673" data-permalink="https://www.relataly.com/signpost-decisions-recommender-system-machine-learning-python-tutorial-midjourney-2-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/signpost-decisions-recommender-system-machine-learning-python-tutorial-midjourney-2-min.png" data-orig-size="508,508" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="signpost decisions recommender system machine learning python tutorial midjourney (2)-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/signpost-decisions-recommender-system-machine-learning-python-tutorial-midjourney-2-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/signpost-decisions-recommender-system-machine-learning-python-tutorial-midjourney-2-min.png" alt="Recommendation systems can ease decision-making. Image created with Midjourney." class="wp-image-12673" srcset="https://www.relataly.com/wp-content/uploads/2023/03/signpost-decisions-recommender-system-machine-learning-python-tutorial-midjourney-2-min.png 508w, https://www.relataly.com/wp-content/uploads/2023/03/signpost-decisions-recommender-system-machine-learning-python-tutorial-midjourney-2-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/signpost-decisions-recommender-system-machine-learning-python-tutorial-midjourney-2-min.png 140w" sizes="(max-width: 508px) 100vw, 508px" /><figcaption class="wp-element-caption">Recommendation systems can ease decision-making. Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a>.</figcaption></figure>
</div>
</div>
</div>
</div>



<h2 class="wp-block-heading" id="h-what-is-content-based-filtering">What is Content-Based Filtering?</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The idea behind content-based recommenders is to generate recommendations based on user&#8217;s preferences and tastes. These preferences revolve around past user choices, for example, the number of times a user has watched a movie, purchased an item, or clicked on a link. </p>



<p class="wp-block-paragraph">Content-based filtering uses domain-specific item features to measure the similarity between items. Given the user preferences, the algorithm will recommend items similar to what the user has consumed or liked before. For movie recommendations, this content can be the genre, actors, release year, director, film length, or keywords used to describe the movies. This approach works particularly well for domains with a lot of textual metadata, such as movies and videos, books, or products.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-kadence-image kb-image_efc597-b6 size-large"><img decoding="async" width="512" height="500" data-attachment-id="12668" data-permalink="https://www.relataly.com/polaroid-film-movie-recommender-machine-learning-python-tutorial-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/polaroid-film-movie-recommender-machine-learning-python-tutorial-min.png" data-orig-size="518,506" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="polaroid film movie recommender machine learning python tutorial-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/polaroid-film-movie-recommender-machine-learning-python-tutorial-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/polaroid-film-movie-recommender-machine-learning-python-tutorial-min-512x500.png" alt="" class="kb-img wp-image-12668" srcset="https://www.relataly.com/wp-content/uploads/2023/03/polaroid-film-movie-recommender-machine-learning-python-tutorial-min.png 512w, https://www.relataly.com/wp-content/uploads/2023/03/polaroid-film-movie-recommender-machine-learning-python-tutorial-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/polaroid-film-movie-recommender-machine-learning-python-tutorial-min.png 518w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption>Content-based movie recommendations will suggest more of the same, for example, actors, genres, stories, and directors.<br></figcaption></figure>
</div>
</div>



<p class="wp-block-paragraph"></p>



<h3 class="wp-block-heading">Basic Steps to Building a Content-based Recommender System</h3>



<p class="wp-block-paragraph">The approach to building a content-based recommender involves four essential steps:</p>



<ol class="wp-block-list">
<li>The first step is to create a so-called &#8216;bag of words&#8217; model from the input data, which is a list of words used to characterize the items. This step involves selecting useful content for describing and differentiating the items. The more precise the information, the better the recommendations will be. </li>



<li>The next step is to turn the bag (of words) into a feature vector. Different algorithms can be used for this step, for example, the Tfdif vectorizer or the count vectorizer. The result is a vector matrix with items as records and features as columns. This step often also includes applying techniques for dimensionality reduction. </li>



<li>The idea of content-based recommendations is based on measuring item similarity. Similarity scores are assigned through pairwise comparison. Here again, we can choose between different measures, e.g., the dot product or cosine similarity. </li>



<li>Once you have the similarity scores, you can return the most similar items by sorting the data by similarity scores. Given user preferences (single or multiple items a user consumed or liked), the algorithm will then recommend the most similar items. </li>
</ol>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8937" data-permalink="https://www.relataly.com/content-based-movie-recommender-using-python/4294/image-11-5/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/image-11.png" data-orig-size="1911,1227" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-11" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/image-11.png" src="https://www.relataly.com/wp-content/uploads/2022/07/image-11-1024x657.png" alt="" class="wp-image-8937" width="1169" height="751" srcset="https://www.relataly.com/wp-content/uploads/2022/07/image-11.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/image-11.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/image-11.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/image-11.png 1536w, https://www.relataly.com/wp-content/uploads/2022/07/image-11.png 1911w" sizes="(max-width: 1169px) 100vw, 1169px" /><figcaption class="wp-element-caption">Approach to Building a Content-based Recommender System</figcaption></figure>



<h3 class="wp-block-heading">Similarity Scoring</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The quality of the content-based recommendations is significantly influenced by how well the algorithm succeeds in measuring the similarity of the items. There are different techniques to calculate similarity, including Cosine Similarity, Pearson Similarity, Dot Product, and Euclidian Distance. They have in common that they use numerical characteristics of the text to calculate the distance between text vectors in an n-dimensional vector space.</p>



<p class="wp-block-paragraph">It is worth denoting that these techniques can only measure word-level similarity. This means the algorithms compare the word of the item for word without considering the semantic meaning of the sentences. In some instances, this can lead to errors. For example, how similar are <em>&#8220;now that they were sitting on a bank, he noticed she stole his heart, and he was in love&#8221;</em> and <em>&#8220;They are gangsters who love to steal from a large bank&#8221;</em>? By just looking at the words, one may appear similar because the words have a good overlap.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>



<p class="wp-block-paragraph"></p>



<h3 class="wp-block-heading">Pros and Cons of Content-based Filtering</h3>



<p class="wp-block-paragraph">Like most machine learning algorithms, content-based recommenders have their strength and weaknesses.</p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph"><strong>Advantages</strong></p>



<ul class="wp-block-list">
<li>Content-based filtering is good at capturing a user&#8217;s specific interests and will recommend more of the same (for example, genre, actors, directors, etc.). It will also recommend niche items if they match the user preferences, even if these items draw little attention.</li>



<li>Another advantage is that the model can generate recommendations for a specific user without the knowledge of other users. This is particularly helpful if you want to generate predictions for many users.</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph"><strong>Disadvantages</strong></p>



<ul class="wp-block-list">
<li>On the other hand, there are also a couple of downsides. The feature representation of the items has to be done manually to a certain extent, and the prediction quality strongly depends on whether items are described in detail. Therefore, content-based filtering requires a lot of expertise.</li>



<li>Since recommendations are based on the user&#8217;s previous interests. However, the recommendations are unlikely to go beyond that and expand to areas (e.g., genres) that are still unknown to the user.  Content-based models thus tend to develop some tunnel vision, so that the model recommends more and more of the same.</li>
</ul>
</div>
</div>



<div style="height:54px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading" id="h-implementing-a-content-based-movie-recommender-in-python">Implementing a Content-based Movie Recommender in Python</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In the following, we will implement a content-based movie recommender using Python and Scikit-learn. We will carry out all steps necessary to create a content-based recommender. The data comes from an IMDB dataset containing more than 40k films between 1996 and 2018. Based on the data, we define the features we want to use for recommending the movies. These features include the genre, director, main actors, plot keywords, or other metadata associated with the movies. Then we preprocess the data to extract these features and create a feature matrix. The feature matrix becomes the foundation for a similarity matrix that measures the similarity between the items based on their feature vectors. Finally, we use the similarity matrix to generate recommendations for a given item. </p>



<p class="wp-block-paragraph">By the end of this Python tutorial, you will have learned how to implement a content-based recommendation system for movies using Python and Scikit-learn. This knowledge can be applied to other types of recommendations, such as articles, products, or songs.</p>



<p class="wp-block-paragraph">The code is available on the GitHub repository. </p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_c5cd86-dc"><a class="kb-button kt-button button kb-btn_a1ac8b-37 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/05%20Recommender%20Systems/032%20Movie%20Recommender%20using%20Content-based%20Filtering.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_f7ca5b-85 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full"><img decoding="async" src="https://www.relataly.com/wp-content/uploads/2023/01/DALL·E-2023-01-12-19.26.12-A-plush-robot-watching-a-movie-on-tv-min.png" alt="This Robot doesn't know what to watch on tv. Let's build a recommender system for him! Image generated using DALL-E 2 by OpenAI." class="wp-image-11993"/><figcaption class="wp-element-caption">This Robot doesn&#8217;t know what to watch on tv. Let&#8217;s build a recommender system for him! Image generated using <a href="https://openai.com/dall-e-2/" target="_blank" rel="noreferrer noopener">DALL-E 2 by OpenAI</a>.</figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before you start with the coding part, ensure you have set up your Python 3 environment and required packages. If you don&#8217;t have an environment, consider the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda Python environment</a>. Follow <a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a> to set it up.</p>



<p class="wp-block-paragraph">Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using <a href="https://seaborn.pydata.org/" target="_blank" rel="noreferrer noopener">Seaborn </a>for visualization and the natural language processing library <a href="https://www.nltk.org/" target="_blank" rel="noreferrer noopener">nltk</a>. </p>



<p class="wp-block-paragraph">You can install these packages by using one of the following commands: </p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-about-the-imdb-movies-dataset">About the IMDB Movies Dataset</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">We will train our movie recommender on a popular Movies Dataset (you can download it from <a href="https://grouplens.org/datasets/movielens/latest/" target="_blank" rel="noreferrer noopener">grouplens.org</a>). The MovieLens recommendation service collected the Dataset from 610 users between 1996 and 2018. Unpack the data into the working folder of your project.</p>



<p class="wp-block-paragraph">The full Dataset contains metadata on over 45,000 movies and 26 million ratings from over 270,000 users. The Dataset contains the following files (Source of the data description: <a href="https://www.kaggle.com/rounakbanik/the-movies-dataset" target="_blank" rel="noreferrer noopener">Kaggle.com</a>):</p>



<ul class="wp-block-list">
<li><strong>movies_metadata.csv:</strong>&nbsp;The main Movies Metadata file contains information on 45,000 movies featured in the Full MovieLens Dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries, and companies.
<ul class="wp-block-list">
<li></li>
</ul>
</li>



<li><strong>ratings_small.csv:</strong>&nbsp;The subset of 100,000 ratings from 700 users on 9,000 movies. Each line corresponds to a 5-star movie rating with half-star increments (0.5 &#8211; 5.0 stars).</li>



<li><strong>keywords.csv:</strong>&nbsp;Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.</li>



<li><strong>credits.csv:</strong>&nbsp;Consists of Cast and Crew Information for all our films. Available in the form of a stringified JSON Object.</li>
</ul>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="7128" data-permalink="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/mdb-movie-database/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" data-orig-size="1024,537" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="recomender systems collaborative filtering imdb movies" data-image-description="&lt;p&gt;recomender systems collaborative filtering imdb movies&lt;/p&gt;
" data-image-caption="&lt;p&gt;recomender systems collaborative filtering imdb movies&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" src="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" alt="MDB Movie Database Recommender Systems Collaborative Filtering " class="wp-image-7128" width="366" height="192" srcset="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 768w" sizes="(max-width: 366px) 100vw, 366px" /><figcaption class="wp-element-caption">IMDB Movie Database </figcaption></figure>
</div>
</div>



<p class="wp-block-paragraph">Several other files are included that we won&#8217;t use, incl. ratings_small, links_small, and links.</p>



<p class="wp-block-paragraph">You can download it <a href="https://grouplens.org/datasets/movielens/latest/" target="_blank" rel="noreferrer noopener">here</a> or from <a href="https://www.kaggle.com/rounakbanik/the-movies-dataset" target="_blank" rel="noreferrer noopener">Kaggle</a>.</p>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1: Load the Data</h3>



<p class="wp-block-paragraph">Our goal is to create a content-based recommender system for movie recommendations. In this case, the content will be meta information on movies, such as genre, actors, the description.</p>



<p class="wp-block-paragraph">We begin by making imports and loading the data from three files:</p>



<ul class="wp-block-list">
<li>movies_metadata.csv</li>



<li>credits.csv</li>



<li>keywords.csv</li>
</ul>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from nltk.corpus import stopwords

# the IMDB movies data is available on Kaggle.com
# https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset

# in case you have placed the files outside of your working directory, you need to specify the path
path = 'data/movie_recommendations/' 

# load the movie metadata
df_meta=pd.read_csv(path + 'movies_metadata.csv', low_memory=False, encoding='UTF-8') 

# some records have invalid ids, which is why we remove them
df_meta = df_meta.drop([19730, 29503, 35587])

# convert the id to type int and set id as index
df_meta = df_meta.set_index(df_meta['id'].str.strip().replace(',','').astype(int))
pd.set_option('display.max_colwidth', 20)
df_meta.head(2)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		adult	belongs_to_collection			budget		genres				homepage			id		imdb_id		original_language	original_title	overview			...	release_date	revenue		runtime	spoken_languages	status		tagline	title	video	vote_average	vote_count
id																					
862		False	{'id': 10194, 'n...				30000000	[{'id': 16, 'nam...	http://toystory....	862		tt0114709	en					Toy Story		Led by Woody, An...	...	1995-10-30		373554033.0	81.0	[{'iso_639_1': '...	Released	NaN	Toy Story	False	7.7	5415.0
8844	False	NaN								65000000	[{'id': 12, 'nam...	NaN					8844	tt0113497	en				Jumanji				When siblings Ju...	...	1995-12-15		262797249.0	104.0	[{'iso_639_1': '...	Released	Roll the dice an...	Jumanji	False	6.9	2413.0</pre></div>



<p class="wp-block-paragraph">After we have loaded credits and keywords, we will combine the data into a single dataframe. Now we have various input fields available. However, we will only use keywords, cast, year of release, genres, and overview. If you like, you can enhance the data with additional inputs, for example, budget, running time, or film language. </p>



<p class="wp-block-paragraph">Once we have gathered our data in a single dataframe, we print out the first rows to gain an overview of the data. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># load the movie credits
df_credits = pd.read_csv(path + 'credits.csv', encoding='UTF-8')
df_credits = df_credits.set_index('id')

# load the movie keywords
df_keywords=pd.read_csv(path + 'keywords.csv', low_memory=False, encoding='UTF-8') 
df_keywords = df_keywords.set_index('id')

# merge everything into a single dataframe 
df_k_c = df_keywords.merge(df_credits, left_index=True, right_on='id')
df = df_k_c.merge(df_meta[['release_date','genres','overview','title']], left_index=True, right_on='id')
df.head(3)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		keywords			cast				crew				release_date	genres				overview			title
id							
862		[{'id': 931, 'na...	[{'cast_id': 14,...	[{'credit_id': '...	1995-10-30		[{'id': 16, 'nam...	Led by Woody, An...	Toy Story
8844	[{'id': 10090, '...	[{'cast_id': 1, ...	[{'credit_id': '...	1995-12-15		[{'id': 12, 'nam...	When siblings Ju...	Jumanji
15602	[{'id': 1495, 'n...	[{'cast_id': 2, ...	[{'credit_id': '...	1995-12-22		[{'id': 10749, '...	A family wedding...	Grumpier Old Men</pre></div>



<div style="height:15px" aria-hidden="true" class="wp-block-spacer"></div>



<p class="wp-block-paragraph">We can see cast, crew, and genres have a dictionary-like structure. To create a cosine similarity matrix, we need to extract the keywords from these columns and gather them in a single column. This is what we will do in the next step. </p>



<h3 class="wp-block-heading" id="h-step-2-feature-engineering-and-data-cleaning">Step #2: Feature Engineering and Data Cleaning</h3>



<p class="wp-block-paragraph">A problem with modeling text is that machine learning algorithms have difficulty processing text directly. An essential step in creating content-based recommenders is bringing the text into a machine-readable form. This is what we call feature engineering. </p>



<h4 class="wp-block-heading">2.1 Creating a Bag-of-Words Model</h4>



<p class="wp-block-paragraph">We begin with feature engineering and creating the bag of words. As mentioned, a bag of words is a list of words relevant to describe items in a dataset, such as films, and differentiate them. Creating a bag of words removes stopwords but preserves multiplicity so that words can occur multiple times in the concatenated text. Later, each word can be used as a feature in calculating cosine similarities. </p>



<p class="wp-block-paragraph">The input for a bag of words does not necessarily come from a single input column. We will use keywords, genres, cast, and overview and merge them into a new single column that we call tags. Make sure to capture the text field&#8217;s nature. We will keep names and surnames together and not split them, as we will do with the words from the overview column. The result of this process is our bag.</p>



<p class="wp-block-paragraph">In addition, we add the movie title and a new index (id), which will later ease working with the similarity matrix. Finally, we print the first rows of our feature dataframe. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># create an empty DataFrame
df_movies = pd.DataFrame()

# extract the keywords
df_movies['keywords'] = df['keywords'].apply(lambda x: [i['name'] for i in eval(x)])
df_movies['keywords'] = df_movies['keywords'].apply(lambda x: ' '.join([i.replace(&quot; &quot;, &quot;&quot;) for i in x]))

# extract the overview
df_movies['overview'] = df['overview'].fillna('')

# extract the release year 
df_movies['release_date'] = pd.to_datetime(df['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

# extract the actors
df_movies['cast'] = df['cast'].apply(lambda x: [i['name'] for i in eval(x)])
df_movies['cast'] = df_movies['cast'].apply(lambda x: ' '.join([i.replace(&quot; &quot;, &quot;&quot;) for i in x]))

# extract genres
df_movies['genres'] = df['genres'].apply(lambda x: [i['name'] for i in eval(x)])
df_movies['genres'] = df_movies['genres'].apply(lambda x: ' '.join([i.replace(&quot; &quot;, &quot;&quot;) for i in x]))

# add the title
df_movies['title'] = df['title']

# merge fields into a tag field
df_movies['tags'] = df_movies['keywords'] + df_movies['cast']+' '+df_movies['genres']+' '+df_movies['release_date']

# drop records with empty tags and dublicates
df_movies.drop(df_movies[df_movies['tags']==''].index, inplace=True)
df_movies.drop_duplicates(inplace=True)

# add a fresh index to the dataframe, which we will later use when refering to items in a vector matrix
df_movies['new_id'] = range(0, len(df_movies))

# Reduce the data to relevant columns
df_movies = df_movies[['new_id', 'title', 'tags']]

# display the data
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.expand_frame_repr', False)
print(df_movies.shape)
df_movies.head(5)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		new_id	title							tags
id			
862		0		Toy Story						jealousy toy boy friendship friends rivalry boynextdoor newtoy toycomestolifeTomHanks TimAllen DonRickles JimVarney WallaceShawn JohnRatzenberger AnniePotts JohnMorris ErikvonDetten LaurieMetcalf R.LeeErmey SarahFreeman PennJillette Animation Comedy Family 1995
8844	1		Jumanji							boardgame disappearance basedonchildren'sbook newhome recluse giantinsectRobinWilliams JonathanHyde KirstenDunst BradleyPierce BonnieHunt BebeNeuwirth DavidAlanGrier PatriciaClarkson AdamHann-Byrd LauraBellBundy JamesHandy GillianBarber BrandonObray CyrusThiedeke GaryJosephThorup LeonardZola LloydBerry MalcolmStewart AnnabelKershaw DarrylHenriques RobynDriscoll PeterBryant SarahGilson FloricaVlad JuneLion BrendaLockmuller Adventure Fantasy Family 1995
15602	2		Grumpier Old Men				fishing bestfriend duringcreditsstinger oldmenWalterMatthau JackLemmon Ann-Margret SophiaLoren DarylHannah BurgessMeredith KevinPollak Romance Comedy 1995
31357	3		Waiting to Exhale				basedonnovel interracialrelationship singlemother divorce chickflickWhitneyHouston AngelaBassett LorettaDevine LelaRochon GregoryHines DennisHaysbert MichaelBeach MykeltiWilliamson LamontJohnson WesleySnipes Comedy Drama Romance 1995
11862	4		Father of the Bride Part II		baby midlifecrisis confidence aging daughter motherdaughterrelationship pregnancy contraception gynecologistSteveMartin DianeKeaton MartinShort KimberlyWilliams-Paisley GeorgeNewbern KieranCulkin BDWong PeterMichaelGoetz KateMcGregor-Stewart JaneAdams EugeneLevy LoriAlan Comedy 1995</pre></div>



<h4 class="wp-block-heading" id="h-2-2-visualizing-text-length">2.2 Visualizing Text Length</h4>



<p class="wp-block-paragraph">We can use a bar chart to illustrate each movie&#8217;s word bag length. This gives us an idea of how detailed the movie descriptions are. Items with short descriptions have, in principle, a lower probability of being recommended later. Recommenders produce better results if the length of the descriptions is somewhat balanced.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># add the tag length to the movies df
df_movies['tag_len'] = df_movies['tags'].apply(lambda x: len(x))

# illustrate the tag text length
sns.displot(data=df_movies.dropna(), bins=list(range(0, 2000, 25)), height=5, x='tag_len', aspect=3, kde=True)
plt.title('Distribution of tag text length')
plt.xlim([0, 2500])</pre></div>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="346" data-attachment-id="8866" data-permalink="https://www.relataly.com/content-based-movie-recommender-using-python/4294/tags-text-length-visualization-python-content-based-recommender-system-2/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/tags-text-length-visualization-python-content-based-recommender-system-2.png" data-orig-size="1083,366" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="tags-text-length-visualization-python-content-based-recommender-system-2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/tags-text-length-visualization-python-content-based-recommender-system-2.png" src="https://www.relataly.com/wp-content/uploads/2022/07/tags-text-length-visualization-python-content-based-recommender-system-2-1024x346.png" alt="content-based recommender system - illustration of the distribution of word length in our bag-of-word model" class="wp-image-8866" srcset="https://www.relataly.com/wp-content/uploads/2022/07/tags-text-length-visualization-python-content-based-recommender-system-2.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/tags-text-length-visualization-python-content-based-recommender-system-2.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/tags-text-length-visualization-python-content-based-recommender-system-2.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/tags-text-length-visualization-python-content-based-recommender-system-2.png 1083w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<h3 class="wp-block-heading" id="h-step-3-vectorization-using-tfidfvectorizer">Step #3: Vectorization using TfidfVectorizer</h3>



<p class="wp-block-paragraph">The next step is to create a vector matrix from the Bag of Words model. Each column from the matrix represents a word feature. This step is the basis for determining the similarity of the movies afterward. Before the vectorization, we will remove stop words from the text (e.g., and, it, that, or, why, where, etc.). In addition, I limited the number of features in the matrix to 5000 to reduce training time.</p>



<p class="wp-block-paragraph">A simple vectorization approach is to determine the word frequency for each movie using a count vectorizer. However, a frequently mentioned disadvantage of this approach is that it does not consider how often a word occurs. For example, some words may appear in almost all items. On the other hand, some words may be prevalent in a few items but are rare in general.  So we can argue that observing rare words in an item is more informative than observing common words. Instead of a count vectorizer, we will use a more practical approach called TfidfVectorizer from the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html" target="_blank" rel="noreferrer noopener">scikit-learn package</a>. </p>



<p class="wp-block-paragraph">Tfidf stands for term frequency-inverse document frequency. Compared to a count vectorizer, the tf-idf vectorizer considers the overall word frequencies and weights the general importance of the words when spanning the vectors. This way, tf-idf can determine which words are more important than others, reducing the model&#8217;s complexity and improving performance. This <a href="https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a" target="_blank" rel="noreferrer noopener">medium article</a> explains the math behind tf-idf vectorization in more detail.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># set a custom stop list from nltk
stop = list(stopwords.words('english'))

# create the tfid vectorizer, alternatively you can also use countVectorizer
tfidf =  TfidfVectorizer(max_features=5000, analyzer = 'word', stop_words=set(stop))
vectorized_data = tfidf.fit_transform(df_movies['tags'])
count_matrix = pd.DataFrame(vectorized_data.toarray(), index=df_movies['tags'].index.tolist())
print(count_matrix)

</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">			0     1     2     3     4     5     6     7     8     9     ...  4990  4991  4992  4993  4994  4995  4996  4997  4998  4999
862      	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
8844     	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
15602    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
31357    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
11862    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
...      	...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...
439050   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
111109   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
67758    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
227506   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
461257   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0

[45432 rows x 5000 columns]</pre></div>



<p class="wp-block-paragraph">The vectorization process results in a feature matrix in which each feature is a word from the text bag of words. </p>



<p class="wp-block-paragraph">We can display features with the get_feature_names_out function from the tfidf vectorizer.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># print feature names
print(tfidf.get_feature_names_out()[940:990])</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">['climbing' 'clinteastwood' 'clinthoward' 'clive' 'cliveowen'
 'cliverevill' 'cliverussell' 'clone' 'clorisleachman' 'cloviscornillac'
 'clown' 'clugulager' 'clydekusatsu' 'co' 'coach' 'cobb' 'cocaine' 'code'
 'coffin' 'cohen' 'coldwar' 'cole' 'colehauser' 'coleman' 'colinfarrell'
 'colinfirth' 'colinhanks' 'colinkenny' 'colinsalmon' 'colleencamp'
 'college' 'colmfeore' 'colmmeaney' 'coma' 'combat' 'comedian' 'comedy'
 'comicbook' 'comingofage' 'comingout' 'common' 'communism' 'communist'
 'company' 'competition' 'composer' 'computer' 'con' 'concentrationcamp'
 'concert']</pre></div>



<p class="wp-block-paragraph">As you can see, features are specific words,</p>



<h3 class="wp-block-heading" id="h-step-4-dimensionality-reduction-and-calculate-consine-similarities">Step #4 Dimensionality Reduction and Calculate Consine Similarities</h3>



<p class="wp-block-paragraph">In the previous section, we created a vector matrix that contains movies and features. This matrix is the foundation for calculating similarity scores for all movies. Before we assign feature scores, we will apply dimensionality reduction.</p>



<h4 class="wp-block-heading">4.1 Dimensionality Reduction using SVD</h4>



<p class="wp-block-paragraph">The matrix spans a high-dimensional vector space with more than 5000 feature columns. Do we need all of these features? The answer is most likely not. There are likely a lot of words in the matrix that only occur once or twice. On the other hand, words may occur in almost all movies. How can we deal with this issue?</p>



<p class="wp-block-paragraph">The reason for this is that we have a very dimensional vector space. By reducing this space to fewer, more essential features, we can save some time training our recommender model. We will use TruncatedSVD from the scikit-learn package, a popular algorithm for dimensionality reduction. The algorithm smoothens the matrix and approximates it to a lower dimensional space, thereby reducing noise and model complexity.</p>



<p class="wp-block-paragraph">This way, we will reduce the vector space from 5000 to 3000 features. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># reduce dimensionality for improved performance
svd = TruncatedSVD(n_components=3000)
reduced_data = svd.fit_transform(count_matrix)</pre></div>



<h4 class="wp-block-heading">4.2 Calculate Text Similarity Scores for all Movies</h4>



<p class="wp-block-paragraph">Now that we have reduced the complexity of our vector matrix, we can calculate the similarity scores for all movies. In this process, we assign a similarity score to all item pairs that measure content closeness according to the position of the items in the vector space. </p>



<p class="wp-block-paragraph">We use the cosine function to calculate the similarity value of the movies. The cosine similarity is a mathematical calculation to determine the mathematical similarity of two vectors. In our case, the vectors are the movie descriptions. The cosine similarity function uses these feature vectors to compare each movie to every other and assigns them a similarity value. </p>



<ul class="wp-block-list">
<li>A similarity value of -1 means that two feature vectors are correlated, and the movies are entirely different. </li>



<li>A value of 1 means that the two movies are identical. </li>



<li>A value of 0 is between and means f an average match of the feature vectors.</li>
</ul>



<p class="wp-block-paragraph">The cosine similarity function will calculate pairwise similarities for all movies in our vector matrix. We can determine the number of pairwise comparisons with the formula k²/2, whereby k is the number of items in the vector matrix. In our case, we have a k of 45000 movies. This means the cosine similarity function must calculate about 1 billion similarity scores. So don&#8217;t worry if the process takes some time to complete. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># compute the cosine similarity matrix
similarity = cosine_similarity(reduced_data)
similarity</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">array([[ 1.00000000e+00,  9.75542082e-02,  6.00755620e-02, ...,
        -3.03965235e-04,  0.00000000e+00,  5.81243547e-05],
       [ 9.75542082e-02,  1.00000000e+00,  5.92929339e-02, ...,
        -2.97565163e-03,  0.00000000e+00,  4.57945869e-05],
       [ 6.00755620e-02,  5.92929339e-02,  1.00000000e+00, ...,
         9.40459504e-03,  0.00000000e+00, -2.22415551e-04],
       ...,
       [-3.03965235e-04, -2.97565163e-03,  9.40459504e-03, ...,
         1.00000000e+00,  0.00000000e+00, -2.60823346e-04],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 5.81243547e-05,  4.57945869e-05, -2.22415551e-04, ...,
        -2.60823346e-04,  0.00000000e+00,  1.00000000e+00]])</pre></div>



<div style="height:38px" aria-hidden="true" class="wp-block-spacer"></div>



<h3 class="wp-block-heading" id="h-step-5-generate-content-based-movie-recommendations">Step #5: Generate Content-based Movie Recommendations </h3>



<p class="wp-block-paragraph">Once you have created the similarity matrix, it&#8217;s time to generate some recommendations. We begin by generating recommendations based on a single movie. In the cosine similarity matrix, the most similar movies have the highest similarity scores. Once we have the film with the highest scores, we can visualize the results in a bar chart that shows the cosine similarity scores. </p>



<p class="wp-block-paragraph">The example below displays the results of the movie &#8220;The Matrix.&#8221; Oh, how I love this movie 🙂</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># create a function that takes in movie title as input and returns a list of the most similar movies
def get_recommendations(title, n, cosine_sim=similarity):
    
    # get the index of the movie that matches the title
    movie_index = df_movies[df_movies.title==title].new_id.values[0]
    print(movie_index, title)
    
    # get the pairwsie similarity scores of all movies with that movie and sort the movies based on the similarity scores
    sim_scores_all = sorted(list(enumerate(cosine_sim[movie_index])), key=lambda x: x[1], reverse=True)
    
    # checks if recommendations are limited
    if n &gt; 0:
        sim_scores_all = sim_scores_all[1:n+1]
        
    # get the movie indices of the top similar movies
    movie_indices = [i[0] for i in sim_scores_all]
    scores = [i[1] for i in sim_scores_all]
    
    # return the top n most similar movies from the movies df
    top_titles_df = pd.DataFrame(df_movies.iloc[movie_indices]['title'])
    top_titles_df['sim_scores'] = scores
    top_titles_df['ranking'] = range(1, len(top_titles_df) + 1)
    
    return top_titles_df, sim_scores_all

# generate a list of recommendations for a specific movie title
movie_name = 'The Matrix'
number_of_recommendations = 15
top_titles_df, _ = get_recommendations(movie_name, number_of_recommendations)
 
# visualize the results
def show_results(movie_name, top_titles_df):
    fix, ax = plt.subplots(figsize=(11, 5))
    sns.barplot(data=top_titles_df, y='title', x= 'sim_scores', color='blue')
    plt.xlim((0,1))
    plt.title(f'Top 15 recommendations for {movie_name}')
    pct_values = ['{:.2}'.format(elm) for elm in list(top_titles_df['sim_scores'])]
    ax.bar_label(container=ax.containers[0], labels=pct_values, size=12)

show_results(movie_name, top_titles_df)</pre></div>



<p class="wp-block-paragraph">Example for the movies &#8220;Spectre&#8221; and &#8220;The Lion King&#8221;</p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-full"><img decoding="async" width="830" height="329" data-attachment-id="9748" data-permalink="https://www.relataly.com/content-based-movie-recommender-using-python/4294/image-2-13/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/10/image-2.png" data-orig-size="830,329" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-2" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/10/image-2.png" src="https://www.relataly.com/wp-content/uploads/2022/10/image-2.png" alt="" class="wp-image-9748" srcset="https://www.relataly.com/wp-content/uploads/2022/10/image-2.png 830w, https://www.relataly.com/wp-content/uploads/2022/10/image-2.png 300w, https://www.relataly.com/wp-content/uploads/2022/10/image-2.png 768w" sizes="(max-width: 830px) 100vw, 830px" /></figure>
</div>
</div>



<h3 class="wp-block-heading">Step #6: Generate Content-based Movie Recommendations</h3>



<p class="wp-block-paragraph">But what if you want to generate recommendations for specific users that have seen several movies? For this, we can aggregate the similarity scores for all films the user has seen. This way, we create a new dataframe that sums up similarity scores. To return the top-recommended movies, we can sort this dataframe by similarity scores and replace the top elements. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># list of movies a user has seen
movie_list = ['The Lion King', 'Seven', 'RoboCop 3', 'Blade Runner', 'Quantum of Solace', 'Casino Royale', 'Skyfall']

# create a copy of the movie dataframe and add a column in which we aggregated the scores
user_scores = pd.DataFrame(df_movies['title'])
user_scores['sim_scores'] = 0.0

# top number of scores to be considered for each movie
number_of_recommendations = 10000
for movie_name in movie_list:
    top_titles_df, _ = get_recommendations(movie_name, number_of_recommendations)
    # aggregate the scores
    user_scores = pd.concat([user_scores, top_titles_df[['title', 'sim_scores']]]).groupby(['title'], as_index=False).sum({'sim_scores'})
# sort and print the aggregated scores
user_scores.sort_values(by='sim_scores', ascending=False)[1:20]</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="8888" data-permalink="https://www.relataly.com/content-based-movie-recommender-using-python/4294/image-8-12/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/07/image-8.png" data-orig-size="1375,583" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-8" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2022/07/image-8.png" src="https://www.relataly.com/wp-content/uploads/2022/07/image-8-1024x434.png" alt="Content-based movie recommdations for a user that has previously watched The Lion King, Seven, RoboCop 3, Blade Runner, Quantum of Solace, Casino Royale, Skyfall" class="wp-image-8888" width="805" height="341" srcset="https://www.relataly.com/wp-content/uploads/2022/07/image-8.png 1024w, https://www.relataly.com/wp-content/uploads/2022/07/image-8.png 300w, https://www.relataly.com/wp-content/uploads/2022/07/image-8.png 768w, https://www.relataly.com/wp-content/uploads/2022/07/image-8.png 1375w" sizes="(max-width: 805px) 100vw, 805px" /></figure>



<div style="height:65px" aria-hidden="true" class="wp-block-spacer"></div>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">In this tutorial, you have learned to implement a simple content-based recommender system for movie recommendations in Python.  We have used several movie-specific details to calculate a similarity matrix for all movies in our dataset. Finally, we have used this model to generate recommendations for two cases:</p>



<ul class="wp-block-list">
<li>Films that are similar to a specific movie</li>



<li>Films that are recommended based on the watchlist of a particular user.</li>
</ul>



<p class="wp-block-paragraph">A downside of content-based recommenders is that you cannot test their performance unless you know how users perceived the recommendations. This is because content-based recommenders can only determine which items in a dataset are similar. To understand how well the suggestions work, you must include additional data about actual user preferences. </p>



<p class="wp-block-paragraph">More advanced recommenders will combine content-based recommendations with user-item interactions (<a href="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/" target="_blank" rel="noreferrer noopener">e.g., collaborative filtering</a>). Such models are called hybrid recommenders, but this is something for another article.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-large"><img decoding="async" width="512" height="472" data-attachment-id="12683" data-permalink="https://www.relataly.com/dog-with-popcorn-machine-learning-movie-recommender-python-tutorial-relataly-midjourney-ai-min/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/03/dog-with-popcorn-machine-learning-movie-recommender-python-tutorial-relataly-midjourney-ai-min.png" data-orig-size="532,490" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="dog with popcorn machine learning movie recommender python tutorial relataly midjourney ai-min" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/dog-with-popcorn-machine-learning-movie-recommender-python-tutorial-relataly-midjourney-ai-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/dog-with-popcorn-machine-learning-movie-recommender-python-tutorial-relataly-midjourney-ai-min-512x472.png" alt="" class="wp-image-12683" srcset="https://www.relataly.com/wp-content/uploads/2023/03/dog-with-popcorn-machine-learning-movie-recommender-python-tutorial-relataly-midjourney-ai-min.png 512w, https://www.relataly.com/wp-content/uploads/2023/03/dog-with-popcorn-machine-learning-movie-recommender-python-tutorial-relataly-midjourney-ai-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/dog-with-popcorn-machine-learning-movie-recommender-python-tutorial-relataly-midjourney-ai-min.png 532w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">Image created with <a href="http://www.midjourney.com" target="_blank" rel="noreferrer noopener">Midjourney</a></figcaption></figure>
</div>
</div>



<p class="wp-block-paragraph"></p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p class="wp-block-paragraph">Below are some resources for further reading on recommender systems and content-based models.</p>



<p class="wp-block-paragraph"><strong>Books</strong></p>



<ol class="wp-block-list"><li><a href="https://amzn.to/3T3Sl2V" target="_blank" rel="noreferrer noopener">Charu C. Aggarwal (2016) Recommender Systems: The Textbook</a></li><li><a href="https://amzn.to/3D0P8eX" target="_blank" rel="noreferrer noopener">Kin Falk (2019) Practical Recommender Systems</a></li><li><a href="https://amzn.to/3MAy8j5" target="_blank" rel="noreferrer noopener">Andriy Burkov (2020) Machine Learning Engineering</a></li><li><a href="https://amzn.to/3D0gB0e" target="_blank" rel="noreferrer noopener">Oliver Theobald (2020) Machine Learning For Absolute Beginners: A Plain English Introduction</a></li><li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems </a></li><li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li></ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<p class="wp-block-paragraph"><strong>Articles</strong></p>



<ol class="wp-block-list"><li><a href="https://surprise.readthedocs.io/en/stable/getting_started.html" target="_blank" rel="noreferrer noopener">Getting started with Suprise</a></li><li><a href="https://onespire.hu/sap-news-en/history-of-recommender-systems/" target="_blank" rel="noreferrer noopener">About the history of Recommender Systems</a></li><li><a href="https://www.freecodecamp.org/news/singular-value-decomposition-vs-matrix-factorization-in-recommender-systems-b1e99bc73599/" target="_blank" rel="noreferrer noopener"></a><a href="https://www.freecodecamp.org/news/singular-value-decomposition-vs-matrix-factorization-in-recommender-systems-b1e99bc73599/" target="_blank" rel="noreferrer noopener">Singular value decomposition vs. matrix factorization</a></li><li><a href="https://proceedings.neurips.cc/paper/2007/file/d7322ed717dedf1eb4e6e52a37ea7bcd-Paper.pdf" target="_blank" rel="noreferrer noopener">Probabilistic Matrix Factorization</a></li></ol>
<p>The post <a href="https://www.relataly.com/content-based-movie-recommender-using-python/4294/">Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/content-based-movie-recommender-using-python/4294/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">4294</post-id>	</item>
		<item>
		<title>Build a High-Performing Movie Recommender System using Collaborative Filtering in Python</title>
		<link>https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/</link>
					<comments>https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/#comments</comments>
		
		<dc:creator><![CDATA[Florian Follonier]]></dc:creator>
		<pubDate>Mon, 31 May 2021 10:29:35 +0000</pubDate>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Collaborative Filtering]]></category>
		<category><![CDATA[Recommender Systems]]></category>
		<category><![CDATA[Retail]]></category>
		<category><![CDATA[Surprise Lib]]></category>
		<category><![CDATA[Use Cases]]></category>
		<category><![CDATA[AI in E-Commerce]]></category>
		<category><![CDATA[Collaborative Filtering Recommender System]]></category>
		<category><![CDATA[Intermediate Tutorials]]></category>
		<category><![CDATA[SVD]]></category>
		<category><![CDATA[Unsupervised Learning]]></category>
		<guid isPermaLink="false">https://www.relataly.com/?p=4376</guid>

					<description><![CDATA[<p>The digital age presents us with an unmanageable number of decisions and even more options. Which series to watch today? What song to listen to next? Nowadays, the internet and its vast content offer too many choices. But there is hope &#8211; recommender systems are here to solve this problem and support our decision-making. They ... <a title="Build a High-Performing Movie Recommender System using Collaborative Filtering in Python" class="read-more" href="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/" aria-label="Read more about Build a High-Performing Movie Recommender System using Collaborative Filtering in Python">Read more</a></p>
<p>The post <a href="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/">Build a High-Performing Movie Recommender System using Collaborative Filtering in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The digital age presents us with an unmanageable number of decisions and even more options. Which series to watch today? What song to listen to next? Nowadays, the internet and its vast content offer too many choices. But there is hope &#8211; recommender systems are here to solve this problem and support our decision-making. They rank among the fascinating use cases for machine learning and intelligently filter information to present us with a smaller set of options that most likely fit our tastes and needs. This tutorial will explore recommender systems and implement a movie recommender in Python that uses &#8220;Collaborative Filtering.&#8221;</p>



<p class="wp-block-paragraph">This article is structured as follows: We begin by briefly going through the basics of different types of recommender systems. Then we will look at the most common recommender algorithms and go into more detail on Collaborative Filtering. Once equipped with this conceptual understanding, we will develop our recommender system using the popular 100k Movies Dataset. We will train and test a recommender model to predict movie ratings. The library used is the Scikit-Surprise Python library. The recommendation approach combines Collaborative Filtering and Singular Value Decomposition (SVD).</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"></div>
</div>
</div>
</div>



<h2 class="wp-block-heading" id="h-an-overview-of-recommender-techniques">An Overview of Recommender Techniques</h2>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">The first attempts with recommendation systems reach back to the 1970s. The approach was relatively simple and categorized users into groups to suggest the same content to all users in the same group. However, it was a breakthrough because, for the first time, a program could make personalized recommendations. As the importance of recommendation engines increased, so did the interest in improving their predictions.</p>



<p class="wp-block-paragraph">With the rise of the internet and the rapidly growing amount of data available on the web, filtering relevant content has become increasingly important. Large tech companies such as Netflix (TV shows and movies), Amazon (products), or Facebook (user content and profiles) understood early on that they could use recommendation systems to personalize the selection of user content shown to their customers. They all face the same challenge of having massive content and only limited space to display it to their users. Therefore, it has become crucial for them to select and display only the content that matches the individual interests of their users.</p>



<p class="wp-block-paragraph">Internet companies are predestined for using recommender systems, as they have massive amounts of user data available. The user data is the foundation for generating personalized recommendations on a large scale by analyzing behavior patterns among larger groups of users to tailor suggestions to the taste of individuals.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%"><div class="wp-block-image">
<figure class="alignright size-full"><img decoding="async" width="512" height="512" data-attachment-id="11938" data-permalink="https://www.relataly.com/reinforcement-learning-agents-learn-from-their-activities-if-our-reward-function-tells-them-how-they-perform/" data-orig-file="https://www.relataly.com/wp-content/uploads/2023/01/Reinforcement-Learning-Agents-learn-from-their-activities-if-our-reward-function-tells-them-how-they-perform.png" data-orig-size="512,512" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Reinforcement Learning &amp;#8211; Agents learn from their activities if our reward function tells them how they perform" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2023/01/Reinforcement-Learning-Agents-learn-from-their-activities-if-our-reward-function-tells-them-how-they-perform.png" src="https://www.relataly.com/wp-content/uploads/2023/01/Reinforcement-Learning-Agents-learn-from-their-activities-if-our-reward-function-tells-them-how-they-perform.png" alt="" class="wp-image-11938" srcset="https://www.relataly.com/wp-content/uploads/2023/01/Reinforcement-Learning-Agents-learn-from-their-activities-if-our-reward-function-tells-them-how-they-perform.png 512w, https://www.relataly.com/wp-content/uploads/2023/01/Reinforcement-Learning-Agents-learn-from-their-activities-if-our-reward-function-tells-them-how-they-perform.png 300w, https://www.relataly.com/wp-content/uploads/2023/01/Reinforcement-Learning-Agents-learn-from-their-activities-if-our-reward-function-tells-them-how-they-perform.png 140w" sizes="(max-width: 512px) 100vw, 512px" /><figcaption class="wp-element-caption">The more choices we have, the more difficult it gets to decide </figcaption></figure>
</div></div>
</div>



<p class="wp-block-paragraph"></p>



<h3 class="wp-block-heading" id="h-three-common-approaches-to-recommender-systems">Three Common Approaches to Recommender Systems</h3>



<p class="wp-block-paragraph">Nowadays, many different approaches can generate recommendations. However, most recommender systems use one of the following three techniques:</p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4348" data-permalink="https://www.relataly.com/content-based-movie-recommender-using-python/4294/image-29-6/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/05/image-29.png" data-orig-size="1834,751" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-29" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/05/image-29.png" src="https://www.relataly.com/wp-content/uploads/2021/05/image-29-1024x419.png" alt="machine learning for recommender systems -  comparison of three different techniques used to build recommender systems - content based filtering, collaborative filtering, hybrid approach" class="wp-image-4348" width="1056" height="430" srcset="https://www.relataly.com/wp-content/uploads/2021/05/image-29.png 300w, https://www.relataly.com/wp-content/uploads/2021/05/image-29.png 768w" sizes="(max-width: 1056px) 100vw, 1056px" /><figcaption class="wp-element-caption">Popular techniques used to build recommender systems</figcaption></figure>



<h4 class="wp-block-heading" id="h-content-based-filtering">Content-based Filtering&nbsp;</h4>



<p class="wp-block-paragraph" id="h-content-based-filtering-is-a-technique-that-recommends-similar-items-based-on-item-content-naturally-this-approach-is-based-on-metadata-to-find-out-which-items-are-similar-for-example-in-the-case-of-movie-recommendations-the-algorithm-would-look-at-the-genre-cast-or-director-of-a-movie-models-may-also-consider-metadata-on-users-such-as-age-gender-and-so-on-to-suggest-similar-content-to-similar-users-the-similarity-can-be-calculated-using-different-methods-for-example-cosine-similarity-or-minkowski-distance">Content-based filtering&nbsp;is a technique that recommends similar items based on item content. Naturally, this approach is based on metadata to determine which items are similar. For example, in the case of movie recommendations, the algorithm would look at the genre, cast, or director of a movie. Models may also consider metadata on users, such as age, gender, etc., to suggest similar content to similar users. There are different methods to calculate the similarity, for example, Cosine Similarity or Minkowski Distance.</p>



<p class="wp-block-paragraph">A significant challenge in content-based Filtering is the transferability of user preference insights from one item type to another. Often, content-based recommenders struggle to transfer user actions on one item (e.g., book) to other content types (e.g., clothing). In addition, content-based systems tend to develop tunnel vision, leading the engine to recommend more and more of the same.</p>



<p class="wp-block-paragraph">A separate article covers content-based filtering in more detail and shows <a href="https://www.relataly.com/content-based-movie-recommender-using-python/4294/" target="_blank" rel="noreferrer noopener">how to implement this approach for movie recommendations. </a></p>



<h4 class="wp-block-heading" id="h-collaborative-filtering">Collaborative Filtering</h4>



<p class="wp-block-paragraph">Collaborative Filtering&nbsp;is a well-established approach used to build recommendation systems. The recommendations generated through Collaborative Filtering are based on past interactions between a user and a set of items (movies, products, etc.) that are matched against past item-user interactions within a larger group of people. The main idea is to use the interactions between a group of users and a group of items to guess how users rate items they have not yet rated before.</p>



<p class="wp-block-paragraph">A challenge of collaborative filters is known as the cold start problem, which refers to the entry of new users into the system without any ratings. As a result, the engine does not know its interests and cannot make meaningful recommendations. The same applies to new items entering the system (e.g., products) that have not yet received any ratings. As a result, recommendations can become self-reinforcing. Popular content that many users have rated is recommended to almost all other users, making this content even more popular. On the other hand, the engine hardly suggests content with few or no ratings, so no one will rate this content.</p>



<h4 class="wp-block-heading" id="h-hybrid-approach">Hybrid Approach</h4>



<p class="wp-block-paragraph">Combining the previous two techniques in a hybrid approach is also possible. We can implement hybrid approaches by generating content-based and collaborative-based predictions separately and then combining them. The result is a model that considers the interactions between users and items and context information. Hybrid recommender systems often achieve better results than recommendation approaches that use a single of the underlying techniques.</p>



<p class="wp-block-paragraph">Netflix is known to use a hybrid recommendation system. Its engine recommends content to its users based on similar users&#8217; viewing and search habits (Collaborative Filtering). At the same time, it also recommends series and movies whose characteristics match the content users rated highly (content-based Filtering).</p>



<h3 class="wp-block-heading" id="h-how-model-based-collaborative-filtering-works">How Model-based Collaborative Filtering Works</h3>



<p class="wp-block-paragraph">We can further differentiate between memory-based and model-based Collaborative Filtering. This tutorial focuses on model-based Collaborative Filtering, which is more commonly used. </p>



<h3 class="wp-block-heading" id="h-behavioral-patterns-dependencies-among-users-and-items">Behavioral Patterns: Dependencies among Users and Items</h3>



<p class="wp-block-paragraph"></p>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<p class="wp-block-paragraph">Collaborative filtering searches for behavioral patterns in interactions between a group of users and a group of items to infer the interests of individuals. The input data for collaborative Filtering is typically in the form of a user/item matrix filled with ratings (as shown below).</p>



<p class="wp-block-paragraph">Patterns can exist in the user/item matrix in dependencies between users and items. Some dependencies are easy to grasp. Similar dependencies exist between items in that some movies receive high ratings from the same users. For example, assume two users, Ron and Joe, have rated movies. Ron enjoyed Batman, Indiana Jones, Star Wars, and Godzilla. Joe enjoyed the same movies as Ron, except Godzilla, which he has not yet rated. Based on the similarity between Joe and Ron, we assume that Joe would also enjoy Godzilla. </p>



<p class="wp-block-paragraph">Things get more complex as latent dependencies are present in the data. Imagine Bob gave a three-star rating to five different movies. Another user, Jenny, rated the same movies as Bob but always gave four stars, which is an example of latent dependency. There is some form of dependence between the two users, and although it is not as significant as in the first example, considering latent dependencies will improve predictions.</p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow">
<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4441" data-permalink="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/image-3-14/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-3.png" data-orig-size="402,303" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-3" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-3.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-3.png" alt="Dependencies among users and items for movie recommendations" class="wp-image-4441" width="327" height="247" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-3.png 402w, https://www.relataly.com/wp-content/uploads/2021/06/image-3.png 300w" sizes="(max-width: 327px) 100vw, 327px" /><figcaption class="wp-element-caption">user/movies matrix</figcaption></figure>
</div>
</div>



<p class="wp-block-paragraph"></p>



<h3 class="wp-block-heading" id="h-machine-learning-and-dimensionality-reduction">Machine Learning and Dimensionality Reduction</h3>



<p class="wp-block-paragraph">Model-based collaborative filtering techniques estimate the parameters of statistical models to predict how individual users would rate an unrated item. A widely used approach formulates this problem as a classification task that considers items over users as features and ratings as prediction labels (as shown in the matrix). We can use various algorithms to solve such an optimization problem, including gradient-based techniques or alternating least squares.</p>



<p class="wp-block-paragraph">However, user/item matrices can become very large, making searching for patterns computationally expensive. Also, users will typically rate only a tiny fraction of the items in the matrix, so algorithms must deal with an abundant number of missing values (sparse matrix). Therefore, combining machine learning or deep learning with techniques for dimensionality reduction has become state-of-the-art.</p>



<p class="wp-block-paragraph">One of the most widely used techniques for dimensionality reduction is matrix factorization. This approach compresses the initial sparse user/item matrix and presents it as separate matrices that present items and users as unknown feature vectors (as shown below). Such a matrix is densely populated and thus easier to handle, but it also enables the model to uncover latent dependencies among items and users, which increases model accuracy. </p>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="276" data-attachment-id="4462" data-permalink="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/image-7-14/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/06/image-7.png" data-orig-size="1299,350" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-7" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/06/image-7.png" src="https://www.relataly.com/wp-content/uploads/2021/06/image-7-1024x276.png" alt="Matrix Factorization applied to the sparse Items / User Matrix" class="wp-image-4462" srcset="https://www.relataly.com/wp-content/uploads/2021/06/image-7.png 1024w, https://www.relataly.com/wp-content/uploads/2021/06/image-7.png 300w, https://www.relataly.com/wp-content/uploads/2021/06/image-7.png 768w, https://www.relataly.com/wp-content/uploads/2021/06/image-7.png 1299w" sizes="(max-width: 1024px) 100vw, 1024px" /><figcaption class="wp-element-caption">Matrix Factorization is applied to the sparse Items / User Matrix.</figcaption></figure>



<h3 class="wp-block-heading" id="h-python-libraries-for-collaborative-filtering">Python Libraries for Collaborative Filtering</h3>



<p class="wp-block-paragraph">So far, only a few Python libraries support model-based collaborative Filtering out of the box. The most well-known libraries for recommender systems are probably <a href="https://github.com/NicolasHug/Surprise" target="_blank" rel="noreferrer noopener">Scikit-Suprise</a> and <a href="https://www.fast.ai/" target="_blank" rel="noreferrer noopener">Fast.ai for Pytorch</a>. </p>



<p class="wp-block-paragraph">Below you find an overview of the different algorithms that these libraries support. </p>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4303" data-permalink="https://www.relataly.com/content-based-movie-recommender-using-python/4294/image-18-8/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/05/image-18.png" data-orig-size="989,218" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-18" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/05/image-18.png" src="https://www.relataly.com/wp-content/uploads/2021/05/image-18.png" alt="different algorithms used to train recommender systems based on collaborative filtering " class="wp-image-4303" width="592" height="130" srcset="https://www.relataly.com/wp-content/uploads/2021/05/image-18.png 989w, https://www.relataly.com/wp-content/uploads/2021/05/image-18.png 300w, https://www.relataly.com/wp-content/uploads/2021/05/image-18.png 768w" sizes="(max-width: 592px) 100vw, 592px" /></figure>



<h2 class="wp-block-heading" id="h-implementing-a-movie-recommender-in-python-using-collaborative-filtering">Implementing a Movie Recommender in Python using Collaborative Filtering</h2>



<p class="wp-block-paragraph">Now it&#8217;s time to get our hands dirty and begin with implementing our movie recommender. As always, you find the code in the relataly git-hub repository. </p>



<div class="wp-block-kadence-advancedbtn kb-buttons-wrap kb-btns_6a7ed1-b5"><a class="kb-button kt-button button kb-btn_7902e3-f8 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-tutorials/blob/master/05%20Recommender%20Systems/031Movie%20Recommender%20using%20Collaborative%20Filtering.ipynb" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fe_eye kt-btn-icon-side-left"><svg viewBox="0 0 24 24"  fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M1 12s4-8 11-8 11 8 11 8-4 8-11 8-11-8-11-8z"/><circle cx="12" cy="12" r="3"/></svg></span><span class="kt-btn-inner-text">View on GitHub </span></a>

<a class="kb-button kt-button button kb-btn_26bd47-d7 kt-btn-size-standard kt-btn-width-type-full kb-btn-global-inherit kt-btn-has-text-true kt-btn-has-svg-true wp-block-button__link wp-block-kadence-singlebtn" href="https://github.com/flo7up/relataly-public-python-API-tutorials" target="_blank" rel="noreferrer noopener"><span class="kb-svg-icon-wrap kb-svg-icon-fa_github kt-btn-icon-side-left"><svg viewBox="0 0 496 512"  fill="currentColor" xmlns="http://www.w3.org/2000/svg"  aria-hidden="true"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg></span><span class="kt-btn-inner-text">Relataly GitHub Repo </span></a></div>



<h3 class="wp-block-heading" id="h-prerequisites">Prerequisites</h3>



<p class="wp-block-paragraph">Before beginning the coding part, make sure that you have set up your Python 3 environment and required packages. If you don&#8217;t have an environment, consider the&nbsp;<a href="https://www.anaconda.com/products/individual" target="_blank" rel="noreferrer noopener">Anaconda Python environment</a>. Follow <a href="https://www.relataly.com/category/data-science/setup-anaconda-environment/" target="_blank" rel="noreferrer noopener">this tutorial</a> to set it up.</p>



<p class="wp-block-paragraph">Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:&nbsp;</p>



<ul class="wp-block-list">
<li><em><a href="https://pandas.pydata.org/" target="_blank" rel="noreferrer noopener">pandas</a></em></li>



<li><em><a href="https://numpy.org/" target="_blank" rel="noreferrer noopener">NumPy</a></em></li>



<li><em><a href="https://matplotlib.org/" target="_blank" rel="noreferrer noopener">matplotlib</a></em></li>
</ul>



<p class="wp-block-paragraph">In addition, we will be using <a href="https://seaborn.pydata.org/" target="_blank" rel="noreferrer noopener">Seaborn </a>for visualization and the recommender systems library <a href="https://www.fast.ai/" target="_blank" rel="noreferrer noopener"></a><em><a href="https://github.com/NicolasHug/Surprise" target="_blank" rel="noreferrer noopener">Scikit-Suprise</a></em>. You can install the surprise package by forging it with the following command: </p>



<ul class="wp-block-list">
<li><em>conda install -c conda-forge scikit-surprise</em></li>
</ul>



<p class="wp-block-paragraph">You can install the other packages using standard console commands:</p>



<ul class="wp-block-list">
<li><em>pip install &lt;package name&gt;</em></li>



<li><em>conda install &lt;package name&gt;</em>&nbsp;(if you are using the anaconda packet manager)</li>
</ul>



<h3 class="wp-block-heading" id="h-about-the-imdb-movies-dataset">About the IMDB Movies Dataset</h3>



<div class="wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex">
<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:66.66%">
<p class="wp-block-paragraph">We will train our movie recommendation model on a popular Movies Dataset (you can download it from <a href="https://grouplens.org/datasets/movielens/latest/" target="_blank" rel="noreferrer noopener">grouplens.org</a>). The MovieLens recommendation service collected the Dataset from 610 users between 1996 and 2018. Unpack the data into the working folder of your project.</p>



<p class="wp-block-paragraph">The full Dataset contains metadata on over 45,000 movies and 26 million ratings from over 270,000 users. However, we will be working with a subset of the data &#8220;ratings_small.csv,&#8221; which contains 100,836 ratings from 700 users on 9742 movies.</p>



<p class="wp-block-paragraph">The Dataset contains the following files, from which we will only use the first two (Source of the data description: <a href="https://www.kaggle.com/rounakbanik/the-movies-dataset" target="_blank" rel="noreferrer noopener">Kaggle.com</a>):</p>



<ul class="wp-block-list">
<li><strong>movies_metadata.csv:</strong>&nbsp;The main Movies Metadata file contains information on 45,000 movies featured in the Full MovieLens Dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries, and companies.</li>



<li><strong>ratings_small.csv:</strong>&nbsp;The subset of 100,000 ratings from 700 users on 9,000 movies. Each line corresponds to a single 5-star movie rating with half-star increments (0.5 &#8211; 5.0 stars).</li>
</ul>



<p class="wp-block-paragraph">There are several other files included that we won&#8217;t use. </p>
</div>



<div class="wp-block-column is-layout-flow wp-block-column-is-layout-flow" style="flex-basis:33.33%">
<figure class="wp-block-image size-full is-resized"><img decoding="async" data-attachment-id="7128" data-permalink="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/mdb-movie-database/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" data-orig-size="1024,537" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="recomender systems collaborative filtering imdb movies" data-image-description="&lt;p&gt;recomender systems collaborative filtering imdb movies&lt;/p&gt;
" data-image-caption="&lt;p&gt;recomender systems collaborative filtering imdb movies&lt;/p&gt;
" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" src="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" alt="MDB Movie Database Recommender Systems Collaborative Filtering " class="wp-image-7128" width="366" height="192" srcset="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 768w" sizes="(max-width: 366px) 100vw, 366px" /><figcaption class="wp-element-caption">IMDB Movie Database </figcaption></figure>
</div>
</div>



<h3 class="wp-block-heading" id="h-step-1-load-the-data">Step #1: Load the Data</h3>



<p class="wp-block-paragraph">Ensure you have downloaded and unpacked the data and the required packages.</p>



<p class="wp-block-paragraph">You can then load the movie data into our Python project using the code snippet below. We do not need all of the files in the movie dataset and only work with the following two.</p>



<ul class="wp-block-list">
<li>movies_metadata.csv</li>



<li>ratings_small.csv</li>
</ul>



<h4 class="wp-block-heading" id="h-1-1-load-the-movies-data">1.1 Load the Movies Data</h4>



<p class="wp-block-paragraph">First, we will load the movies_metadata, which contains a list of all movies and meta information such as the release year, a short description, etc. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}">import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from surprise import SVD, Dataset, Reader
from surprise.model_selection import train_test_split, cross_validate
from ast import literal_eval

# in case you have placed the files outside of your working directory, you need to specify a path
path = '' # for example: 'data/movie_recommendations/'  

# load the movie metadata
df_moviesmetadata=pd.read_csv(path + 'movies_metadata.csv', low_memory=False) 
print(df_moviesmetadata.shape)
print(df_moviesmetadata.columns)
df_moviesmetadata.head(1)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">(45466, 24)
Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')
	adult	belongs_to_collection								budget		genres												homepage								id	imdb_id		original_language	original_title	overview	...	release_date	revenue	runtime	spoken_languages	status	tagline	title	video	vote_average	vote_count
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	[{'id': 16, 'name': 'Animation'}, {'id': 35, '...	http://toystory.disney.com/toy-story	862	tt0114709	en					Toy Story		Led by Woody, Andy's toys live happily in his ...	...	1995-10-30	373554033.0	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	NaN	Toy Story	False	7.7	5415.0
1 rows × 24 columns</pre></div>



<h4 class="wp-block-heading" id="h-1-2-load-the-ratings-data">1.2 Load the Ratings Data</h4>



<p class="wp-block-paragraph">We proceed by loading the rating file. This file contains the movie ratings for each user, the movieId, and a timestamp.</p>



<p class="wp-block-paragraph">In addition, we print the value counts for rankings in our Dataset. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># load the movie ratings
df_ratings=pd.read_csv(path + 'ratings_small.csv', low_memory=False) 

print(df_ratings.shape)
print(df_ratings.columns)
df_ratings.head(3)

rankings_count = df_ratings.rating.value_counts().sort_values()
sns.barplot(x=rankings_count.index.sort_values(), y=rankings_count, color=&quot;b&quot;)
sns.set_theme(style=&quot;whitegrid&quot;)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">(100004, 4)
Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')
	userId	movieId	rating	timestamp
0	1		31		2.5		1260759144
1	1		1029	3.0		1260759179
2	1		1061	3.0		1260759182</pre></div>



<p class="wp-block-paragraph">As we can see, most of the ratings in our Dataset are positive.</p>



<h3 class="wp-block-heading" id="h-step-2-preprocessing-and-cleaning-the-data">Step #2 Preprocessing and Cleaning the Data</h3>



<p class="wp-block-paragraph">We continue with the preprocessing of the data. The recommendations of a User-based Collaborative Filtering Approach rely solely on the interactions between users and items. This means training a prediction model does not require the meta-information of the movies. Nevertheless, we will load the metadata because it is just nicer to display the recommendations, movie title, release year, and so on, instead of just ids.</p>



<h4 class="wp-block-heading" id="h-2-1-clean-the-movie-data">2.1 Clean the Movie Data</h4>



<p class="wp-block-paragraph">Unfortunately, the data quality of the movies&#8217; metadata is not excellent, so we need to fix a few things. The following operations will change some data types to integers, extract the release year and genres, and remove some records with incorrect data.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># remove invalid records with invalid ids
df_mmeta = df_moviesmetadata.drop([19730, 29503, 35587])

df_movies = pd.DataFrame()

# extract the release year 
df_movies['year'] = pd.to_datetime(df_mmeta['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

# extract genres
df_movies['genres'] = df_mmeta['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

# change the index to movie_id
df_movies['movieId'] = pd.to_numeric(df_mmeta['id'])
df_movies = df_movies.set_index('movieId')

# add vote count
df_movies['vote_count'] = df_movies['vote_count'].astype('int')
df_movies</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">title									vote_count	vote_average	year	genres
		movieId					
862		Toy Story						5415			7.7	1995	[Animation, Comedy, Family]
8844	Jumanji							2413			6.9	1995	[Adventure, Fantasy, Family]
15602	Grumpier Old Men				92				6.5	1995	[Romance, Comedy]
31357	Waiting to Exhale				34				6.1	1995	[Comedy, Drama, Romance]
11862	Father of the Bride Part II		173				5.7	1995	[Comedy]
...	...	...	...	...	...
49279	The Man with the Rubber Head	29				7.6	1901	[Comedy, Fantasy, Science Fiction]
49271	The Devilish Tenant				12				6.7	1909	[Fantasy, Comedy]
49280	The One-Man Band				22				6.5	1900	[Fantasy, Action, Thriller]
404604	Mom								14				6.6	2017	[Crime, Drama, Thriller]
30840	Robin Hood						26				5.7	1991	[Drama, Action, Romance]
22931 rows × 5 columns</pre></div>



<h4 class="wp-block-heading" id="h-2-2-clean-the-ratings-data">2.2 Clean the Ratings Data</h4>



<p class="wp-block-paragraph">Compared to the movie metadata, not much more needs to be done to the rating data. Here we just put the timestamp into a readable format.</p>



<p class="wp-block-paragraph">One of the following steps is to use the Reader class from the Surprise library to parse the ratings and put them into a format compatible with standard recommendation algorithms from the Surprise library. The Reader needs the data in the form where each line contains only one rating and respects the following structure:</p>



<p class="wp-block-paragraph">user; item ; rating ; timestamp</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># drop na values
df_ratings_temp = df_ratings.dropna()

# convert datetime
df_ratings_temp['timestamp'] = pd. to_datetime(df_ratings_temp['timestamp'], unit='s')

print(f'unique users: {len(df_ratings_temp.userId.unique())}, ratings: {len(df_ratings_temp)}')
df_ratings_temp.head()</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">unique users: 671, ratings: 100004
	userId		movieId	rating	timestamp
0	1			31		2.5		2009-12-14 02:52:24
1	1			1029	3.0		2009-12-14 02:52:59
2	1			1061	3.0		2009-12-14 02:53:02
3	1			1129	2.0		2009-12-14 02:53:05
4	1			1172	4.0		2009-12-14 02:53:25</pre></div>



<h3 class="wp-block-heading" id="h-step-3-split-the-data-in-train-and-test">Step #3: Split the Data in Train and Test</h3>



<p class="wp-block-paragraph">Next, we will split the data into train and test sets. In this way, we ensure that we can later evaluate the performance of our recommender model on data that the model has not yet seen. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># The Reader class is used to parse a file containing ratings.
# The file is assumed to specify only one rating per line, such as in the df_ratings_temp file above.
reader = Reader()
ratings_by_users = Dataset.load_from_df(df_ratings_temp[['userId', 'movieId', 'rating']], reader)

# Split the Data into train and test
train_df, test_df = train_test_split(ratings_by_users, test_size=.2)</pre></div>



<p class="wp-block-paragraph">Once we have split the data into train and test, we can train the recommender model.</p>



<h3 class="wp-block-heading" id="h-step-4-train-a-movie-recommender-using-collaborative-filtering">Step #4: Train a Movie Recommender using Collaborative Filtering</h3>



<p class="wp-block-paragraph">Training the SVD model requires only lines of code. The first line creates an untrained model that uses Probabilistic Matrix Factorization for dimensionality reduction. The second line will fit this model to the training data. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># train an SVD model
svd_model = SVD()
svd_model_trained = svd_model.fit(train_df)</pre></div>



<h3 class="wp-block-heading" id="h-step-5-evaluate-prediction-performance-using-cross-validation">Step #5: Evaluate Prediction Performance using Cross-Validation</h3>



<p class="wp-block-paragraph">Next, it is time to validate the performance of our movie recommendation program. For this, we use k-fold cross-validation. As a reminder, cross-validation involves splitting the Dataset into different folds and then measuring the prediction performance based on each fold.</p>



<p class="wp-block-paragraph">We can measure model performance using indicators such as mean absolute error (MAE) or mean squared error (MSE). The MAE is the average difference between predicting a movie and the actual ratings. We chose this measure because it is easy to understand.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># 10-fold cross validation 
cross_val_results = cross_validate(svd_model_trained, ratings_by_users, measures=['RMSE', 'MAE', 'MSE'], cv=10, verbose=False)
test_mae = cross_val_results['test_mae']

# mean squared errors per fold
df_test_mae = pd.DataFrame(test_mae, columns=['Mean Absolute Error'])
df_test_mae.index = np.arange(1, len(df_test_mae) + 1)
df_test_mae.sort_values(by='Mean Absolute Error', ascending=False).head(15)

# plot an overview of the performance per fold
plt.figure(figsize=(6,4))
sns.set_theme(style=&quot;whitegrid&quot;)
sns.barplot(y='Mean Absolute Error', x=df_test_mae.index, data=df_test_mae, color=&quot;b&quot;)
# plt.title('Mean Absolute Error')</pre></div>



<figure class="wp-block-image size-large is-resized"><img decoding="async" data-attachment-id="4385" data-permalink="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/image-41-4/#main" data-orig-file="https://www.relataly.com/wp-content/uploads/2021/05/image-41.png" data-orig-size="598,366" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="image-41" data-image-description="" data-image-caption="" data-large-file="https://www.relataly.com/wp-content/uploads/2021/05/image-41.png" src="https://www.relataly.com/wp-content/uploads/2021/05/image-41.png" alt="movie recommender performance" class="wp-image-4385" width="651" height="399" srcset="https://www.relataly.com/wp-content/uploads/2021/05/image-41.png 598w, https://www.relataly.com/wp-content/uploads/2021/05/image-41.png 300w, https://www.relataly.com/wp-content/uploads/2021/05/image-41.png 80w" sizes="(max-width: 651px) 100vw, 651px" /></figure>



<p class="wp-block-paragraph">The chart above shows that the mean deviation of our predictions from the actual rating is a little below 0.7. The result is not terrific but ok for a first model. In addition, there are no significant differences between the performance in the different folds. Let&#8217;s keep in mind that the MAE says little about possible outliers in the predictions. However, since we are dealing with ordinal predictions (1-5), the influence of outliers is naturally limited.</p>



<h3 class="wp-block-heading" id="h-step-6-generate-predictions">Step #6: Generate Predictions </h3>



<p class="wp-block-paragraph">Finally, we will use our movie recommender to generate a list of suggested movies for a specific test user. The predictions will be based on the user&#8217;s previous movie ratings.</p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># predict ratings for a single user_id and for all movies
user_id = 400 # some test user from the ratings file

# create the predictions
pred_series= []
df_ratings_filtered = df_ratings[df_ratings['userId'] == user_id]

print(f'number of ratings: {df_ratings_filtered.shape[0]}')
for movie_id, name in zip(df_movies.index, df_movies['title']):
    # check if the user has already rated a specific movie from the list
    rating_real = df_ratings.query(f'movieId == {movie_id}')['rating'].values[0] if movie_id in df_ratings_filtered['movieId'].values else 0
    # generate the prediction
    rating_pred = svd_model_trained.predict(user_id, movie_id, rating_real, verbose=False)
    # add the prediction to the list of predictions
    pred_series.append([movie_id, name, rating_pred.est, rating_real])

# print the results
df_recommendations = pd.DataFrame(pred_series, columns=['movieId', 'title', 'predicted_rating', 'actual_rating'])
df_recommendations.sort_values(by='predicted_rating', ascending=False).head(15)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">		movieId	title								predicted_rating	actual_rating
4234	4993	5 Card Stud							4.721481			0.0
3194	318		The Million Dollar Hotel			4.648623			5.0
442		858		Sleepless in Seattle				4.506962			0.0
236		527		Once Were Warriors					4.484758			4.0
2426	926		Galaxy Quest						4.465653			0.0
6532	905		Pandora's Box						4.452688			0.0
710		260		The 39 Steps						4.390318			0.0
8787	3683	Flags of Our Fathers				4.386821			0.0
5400	899		Broken Blossoms						4.384220			0.0
5068	296		Terminator 3: Rise of the Machines	4.383057			4.0
372		2019	Hard Target							4.365834			0.0
7254	919		Blood: The Last Vampire				4.356012			0.0
3295	4973	Under the Sand						4.355750			0.0
3869	194		Amélie								4.353614			0.0
8631	1948	Crank								4.344286			0.0</pre></div>



<p class="wp-block-paragraph">Alternatively, we can predict how well a specific user will rate a movie by handing the user_id and the movie_id to the model. </p>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text/x-python&quot;,&quot;theme&quot;:&quot;monokai&quot;,&quot;lineNumbers&quot;:true,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:true,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}"># predict ratings for the combination of user_id and movie_id
user_id = 217 # some test user from the ratings file
movie_id = 4002
rating_real = df_ratings.query(f'movieId == {movie_id} &amp; userId == {user_id}')['rating'].values[0]
movie_title = df_movies[df_movies.index == 862]['title'].values[0]

print(f'Movie title: {movie_title}')
print(f'Actual rating: {rating_real}')

# predict and show the result
rating_pred = svd_model_trained.predict(user_id, movie_id, rating_real, verbose=True)</pre></div>



<div class="wp-block-codemirror-blocks-code-block code-block"><pre class="CodeMirror" data-setting="{&quot;showPanel&quot;:false,&quot;languageLabel&quot;:false,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text/plain&quot;,&quot;theme&quot;:&quot;3024-day&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}">Movie title: Toy Story
Actual rating: 4.5
user: 217        item: 4002       r_ui = 4.50   est = 3.98   {'was_impossible': False}</pre></div>



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p class="wp-block-paragraph">Congratulations on learning how to develop a movie recommendation system in Python! In this article, you learned about the SVD model, which uses matrix factorization and collaborative filtering to predict movie ratings for a given user. We also demonstrated how to perform cross-validation on a movie dataset and use the model to generate movie recommendations.</p>



<p class="wp-block-paragraph">Several other approaches can be used to develop a movie recommendation system, <a href="https://wp.me/pbUnJO-17g" target="_blank" rel="noreferrer noopener">including content-based filtering</a>, which uses features of the movies themselves (such as genre, director, and cast) to make recommendations, and hybrid systems, which combine the strengths of multiple approaches.</p>



<p class="wp-block-paragraph">Regardless of the approach used, building a movie recommendation system can be a useful tool for recommending movies to users based on their past preferences and can help increase engagement and satisfaction with a movie streaming service or website.</p>



<p class="wp-block-paragraph">If you like the post, please let me know in the comments, and don&#8217;t forget to subscribe to our <a href="https://twitter.com/home" target="_blank" rel="noreferrer noopener">Twitter account</a> to stay up to date on upcoming articles.</p>



<h2 class="wp-block-heading">Sources and Further Reading</h2>



<p class="wp-block-paragraph">Below are some resources for further reading on recommender systems and content-based models.</p>



<p class="wp-block-paragraph"><strong>Books</strong></p>



<ol class="wp-block-list"><li><a href="https://amzn.to/3T3Sl2V" target="_blank" rel="noreferrer noopener">Charu C. Aggarwal (2016) Recommender Systems: The Textbook</a></li><li><a href="https://amzn.to/3D0P8eX" target="_blank" rel="noreferrer noopener">Kin Falk (2019) Practical Recommender Systems</a></li><li><a href="https://amzn.to/3MAy8j5" target="_blank" rel="noreferrer noopener">Andriy Burkov (2020) Machine Learning Engineering</a></li><li><a href="https://amzn.to/3D0gB0e" target="_blank" rel="noreferrer noopener">Oliver Theobald (2020) Machine Learning For Absolute Beginners: A Plain English Introduction</a></li><li><a href="https://amzn.to/3S9Nfkl" target="_blank" rel="noreferrer noopener">Aurélien Géron (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems </a></li><li><a href="https://amzn.to/3EKidwE" target="_blank" rel="noreferrer noopener">David Forsyth (2019) Applied Machine Learning Springer</a></li></ol>



<p class="has-contrast-2-color has-base-3-background-color has-text-color has-background wp-block-paragraph"><em>The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.</em></p>



<p class="wp-block-paragraph"><strong>Articles</strong></p>



<ol class="wp-block-list"><li><a href="https://surprise.readthedocs.io/en/stable/getting_started.html" target="_blank" rel="noreferrer noopener">Getting started with Suprise</a></li><li><a href="https://onespire.hu/sap-news-en/history-of-recommender-systems/" target="_blank" rel="noreferrer noopener">About the history of Recommender Systems</a></li><li><a href="https://www.freecodecamp.org/news/singular-value-decomposition-vs-matrix-factorization-in-recommender-systems-b1e99bc73599/" target="_blank" rel="noreferrer noopener"></a><a href="https://www.freecodecamp.org/news/singular-value-decomposition-vs-matrix-factorization-in-recommender-systems-b1e99bc73599/" target="_blank" rel="noreferrer noopener">Singular value decomposition vs. matrix factorization</a></li><li><a href="https://proceedings.neurips.cc/paper/2007/file/d7322ed717dedf1eb4e6e52a37ea7bcd-Paper.pdf" target="_blank" rel="noreferrer noopener">Probabilistic Matrix Factorization</a></li></ol>
<p>The post <a href="https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/">Build a High-Performing Movie Recommender System using Collaborative Filtering in Python</a> appeared first on <a href="https://www.relataly.com">relataly.com</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.relataly.com/building-a-movie-recommender-using-collaborative-filtering/4376/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">4376</post-id>	</item>
	</channel>
</rss>
