recommender system python

Building a Movie Recommender using Collaborative Filtering in Python

Intelligent recommender systems rank among the fascinating use cases for machine learning. Such systems are used extensively by many of the major tech companies to personalize the selection of user content shown to their customers. In this article, we will look at collaborative filtering as a common technique to build recommender systems and how you can implement this approach to generate movie recommendations in Python.

Large tech companies such as Netflix (TV shows and movies), Amazon (products), or Facebook (user content and profiles) all face the same basic challenge of having a huge amount of content and only limited space available to display it to their users.

movie recommendations given by IMDB
IMDB Movie Recommendations

Therefore, it has become crucial for them to select and display only the content that matches the individual interests of their users. This is where intelligent recommendation systems come into play. They can generate personalized recommendations on a large scale by analyzing behavior patterns among larger groups of users to tailor suggestions to the taste of individuals.

The rest of this article is structured as follows: We begin by briefly going through the basics of different types of recommender systems. Then we will look at the most common recommender algorithms and go more into detail on collaborative filtering. Once we have equipped ourselves with this conceptual understanding, we will develop our recommender system using the popular 100k Movies Dataset. We will train and test a recommender engine that uses Singular Value Decomposition (SVD) to predict movie ratings.

An Overview of Recommender Techniques

The first attempts with recommendation systems reach back to the 1970s. The approach was relatively simple and categorized users into groups to suggest the same content to all users in the same group. However, it was a breakthrough because, for the first time, a program could make personalized recommendations.

With the rise of the Internet and the rapidly growing amount of data available on the web, filtering relevant content has become increasingly important. Large tech companies in particular, such as Amazon or Netflix, understood early on that they could use recommendation systems to address the individual needs of their customers. As the importance of recommendation engines increased, so did the interest in improving their predictions.

Three Common Approaches to Recommender Systems

Today, most recommender systems use one of the following three techniques:

different techniques used to build recommender systems
Three common approaches to recommender systems

Content-based Filtering 

Content-based Filtering is a technique that recommends similar items based on item content. Naturally, this approach is based on metadata to find out which items are similar. For example, in the case of movie recommendations, the algorithm would look at the genre, cast, or director of a movie. Models may also consider metadata on users such as age, gender, etc., to suggest similar content to similar users. The similarity can be calculated using different methods, for example, Cosine Similarity or Minkowski distance.

A major challenge in content-based filtering is the transferability of user preference insights from one item type to another. Often, content-based recommenders struggle to transfer user actions on one item (e.g., book) to other content types (e.g., clothing). In addition, content-based systems tend to develop some tunnel vision, which leads the engine to recommend more and more of the same.

Collaborative Filtering

Collaborative Filtering is a well-established approach used to build recommendation systems. The recommendations generated through collaborative filtering are based on past interactions between a user and a set of items (movies, products, etc.) that are matched against past item-user interactions within a larger group of people. The main idea is to use the interactions between a group of users and a group of items to guess how users rate items that they have not yet rated before.

A particular challenge of collaborative filters is known as the cold start problem: This problem refers to the entry of new users into the system without any ratings. As a result, the engine does not know their interests and cannot make any meaningful recommendations. The same applies to new items entering the system (e.g., products) that have not yet received any ratings. This can lead to a second problem where recommendations become self-reinforcing. Popular content that has been rated by many users is also recommended to almost all other users, which makes this content even more popular. On the other hand, the engine hardly recommends content with few or no ratings, so that no one will rate this content.

Hybrid Approach

It is also possible to combine the previous two techniques in a hybrid approach. The result is a model that considers the interactions between users and items and context information. Hybrid approaches can be implemented by generating content-based and collaborative-based predictions separately and then combining them. Hybrid recommender systems often achieve better results compared to approaches that purely use one of the underlying techniques.

Netflix is known to use a hybrid recommendation system. Its engine recommends content to its users based on the viewing and search habits of similar users (collaborative filtering). At the same time, it also recommends series and movies whose characteristics match the content that users have rated highly (content-based filtering).

How Model-based Collaborative Filtering Works

We can further differentiate between memory-based and model-based collaborative filtering. This article focuses on model-based collaborative filtering, which is more commonly used. Let’s take a closer look at how model-based collaborative filtering works.

Behavioral Patterns: Dependencies among Users and Items

Collaborative filtering searches for behavioral patterns in interactions between a group of users and a group of items to infer the interests of individuals. The input data for collaborative filtering is typically in the form of a user/item matrix filled with ratings (as shown below).

In the user/item matrix, patterns can exist in dependencies between users and items. Some dependencies are easy to grasp. For example, assume two users, Ron and Joe, have rated movies. Ron enjoyed Batman, Indiana Jones, Star Wars, and Godzilla. Joe enjoyed the same movies as Ron, except Godzilla, which he has not yet rated. Based on the similarity between Joe and Ron, we would assume that Joe would also enjoy Godzilla. Similar dependencies exist between items in that some movies receive high ratings from the same users.

user/movies matrix

Things get more complex, as there are also latent dependencies present in the data. Imagine Bob gave a three-star rating to five different movies. Another user, Jenny, rated the same movies as Bob but always gave four stars. This is an example of latent dependency. There is some form of dependency between the two users, and although it is not as significant as in the first example, considering latent dependencies will improve predictions.

Machine Learning and Dimensionality Reduction

Model-based collaborative filtering techniques estimate the parameters of statistical models to predict how individual users would rate an unrated item. A widely used approach formulates this problem as a classification task that considers items over users as features and ratings as prediction labels (as shown in the matrix). Such an optimization task can be solved by various algorithms, including gradient-based techniques or techniques such as alternating least squares.

However, user/item matrices can become very large, which makes searching for patterns computationally expensive. Also, users will typically rate only a tiny fraction of the items in the matrix, so that algorithms need to deal with an abundant number of missing values (sparse matrix). Therefore, it has become state of the art to combine machine learning or deep learning with techniques for dimensionality reduction.

One of the most widely used techniques for dimensionality reduction is matrix factorization. The idea behind this approach is that we can compress the initial sparse user/item matrix and present it as separate matrices that present items and users as unknown feature vectors (as shown below). Such a matrix is densely populated and thus easier to handle, but it also enables the model to uncover latent dependencies among items and users, which increases model accuracy.

Matrix Factorization applied to the sparse Items / User Matrix
Matrix Factorization applied to the sparse Items / User Matrix.

Python Libraries for Collaborative Filtering

So far, only a few Python libraries support model-based collaborative filtering out of the box. The most well-known libraries for recommender systems are probably Scikit-Suprise and Fast.ai for Pytorch.

Below you find an overview of the different algorithms that these libraries support.

different algorithms used to train recommender systems based on collaborative filtering

Implementing a Movie Recommender in Python using Collaborative Filtering

Now it’s time to get our hands dirty and begin with the implementation of our movie recommender. As always, you find the code in the relataly git-hub repository.

Prerequisites

Before we begin with the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment set up yet, consider the Anaconda Python environment. Follow this tutorial to set it up.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages: 

In addition, we will be using seaborn for visualization and the recommender systems library Scikit-Suprise. You can install the surprise package by forging it with the following command:

  • conda install -c conda-forge scikit-surprise

The other packages can be installed using standard console commands:

  • pip install <package name>
  • conda install <package name> (if you are using the anaconda packet manager)

The Movies Dataset

We will train our movie recommendation model on a popular Movies Dataset (you can download it here). The MovieLens recommendation service has collected the dataset from 610 users between 1996 and 2018. Unpack the data into the working folder of your project.

The full dataset contains metadata on over 45,000 movies and 26 million ratings from over 270,000 users. However, we will be working with a subset of the data “ratings_small.csv,” which contains 100,836 ratings from 700 users on 9742 movies.

The dataset contains the following files, from which we will only use the first two:

  • movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.
  • ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies. Each line corresponds to a single 5-star movie rating with half-star increments (0.5 stars – 5.0 stars).

Files that are included but we won’t use:

  • keywords.csv: 
  • credits.csv: 
  • links.csv: 
  • links_small.csv: 

Source of the data description: Kaggle.com

Step #1: Load the Data

Make sure you have downloaded and unpacked the data and the required packages available.

You can then load the movie data into our Python project using the code snippet below. We do not need all of the files in the movie dataset and only work with the following two.

  • movies_metadata.scv
  • ratings_small.csv

1.1 Load the Movies Data

First, we will load the movies_metadata, which contains a list of all movies and meta information such as the release year, a short description, etc.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from surprise import SVD, Dataset, Reader
from surprise.model_selection import train_test_split, cross_validate
from ast import literal_eval

# in case you have placed the files outside of your working directory, you need to specify a path
path = '' # for example: 'data/movie_recommendations/'  

# load the movie metadata
df_moviesmetadata=pd.read_csv(path + 'movies_metadata.csv', low_memory=False) 
print(df_moviesmetadata.shape)
print(df_moviesmetadata.columns)
df_moviesmetadata.head(1)

1.2 Load the Ratings Data

We proceed by loading the rating file. This file contains the movie ratings for each user, along with the movieId and a timestamp.

In addition, we print the value counts for rankings in our dataset.

# load the movie ratings
df_ratings=pd.read_csv(path + 'ratings_small.csv', low_memory=False) 

print(df_ratings.shape)
print(df_ratings.columns)
df_ratings.head(3)

rankings_count = df_ratings.rating.value_counts().sort_values()
sns.barplot(x=rankings_count.index.sort_values(), y=rankings_count, color="b")
sns.set_theme(style="whitegrid")

As we can see, the majority of the ratings in our dataset are positive.

Step #2 Preprocessing and Cleaning the Data

We continue with the preprocessing of the data. The recommendations of a User-based Collaborative Filtering Approach rely solely on the interactions between users and items. This means training a prediction model does not require the meta-information of the movies. Nevertheless, we will load the metadata because it is just nicer to display the recommendations and the movie title, release year, and so on, instead of just ids.

2.1 Clean the Movies Data

Unfortunately, the data quality of the movies’ metadata is not great, so we need to fix a few things. The following operations will change some data types to integer, extract the release year and genres, and remove some records with wrong data.

# remove invalid records with invalid ids
df_mmeta = df_moviesmetadata.drop([19730, 29503, 35587])

df_movies = pd.DataFrame()

# extract the release year 
df_movies['year'] = pd.to_datetime(df_mmeta['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

# extract genres
df_movies['genres'] = df_mmeta['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

# change the index to movie_id
df_movies['movieId'] = pd.to_numeric(df_mmeta['id'])
df_movies = df_movies.set_index('movieId')

# add vote count
df_movies['vote_count'] = df_movies['vote_count'].astype('int')
df_movies

2.2 Clean the Ratings Data

Compared to the movie metadata, not much more needs to be done to the rating data. Here we just put the timestamp into a readable format.

One of the next steps is to use the Reader class from the Surprise library to parse the ratings and put them into a format compatible with standard recommendation algorithms from the Surprise library. The Reader needs the data in the format where each line contains only one rating and respects the following structure:

user ; item ; rating ; [timestamp]
# drop na values
df_ratings_temp = df_ratings.dropna()

# convert datetime
df_ratings_temp['timestamp'] = pd. to_datetime(df_ratings_temp['timestamp'], unit='s')

print(f'unique users: {len(df_ratings_temp.userId.unique())}, ratings: {len(df_ratings_temp)}')
df_ratings_temp.head()

Step #3: Split the Data in Train and Test

Next, we will split the data into train and test sets. This ensures that we can later evaluate the performance of our recommender model on data that the model has not yet seen.

# The Reader class is used to parse a file containing ratings.
# The file is assumed to specify only one rating per line, such as in the df_ratings_temp file above.
reader = Reader()
ratings_by_users = Dataset.load_from_df(df_ratings_temp[['userId', 'movieId', 'rating']], reader)

# Split the Data into train and test
train_df, test_df = train_test_split(ratings_by_users, test_size=.2)

Once we have split the data into train and test, we can train the recommender model.

Step #4: Train a Movie Recommender using Model-Based Collaborative Filtering

Training the SVD model requires only lines of code. The first line creates an untrained model that uses Probabilistic Matrix Factorization for dimensionality reduction. The second line will fit this model to the training data.

# train an SVD model
svd_model = SVD()
svd_model_trained = svd_model.fit(train_df)

Step #5: Evaluate Prediction Performance using Cross-Validation

Next, it is time to validate the performance of our movie recommendation program. For this, we use k-fold cross-validation. As a reminder, cross-validation involves splitting the dataset into different folds and then measuring the prediction performance based on each fold.

We can measure model performance using different indicators, such as mean absolute error (MAE) or mean squared error (MSE). We chose the Mean Absolute Error because it is easy to understand what the indicator means. In our case, the MAE is the average difference between predicting a movie rating and the actual rating.

# 10-fold cross validation 
cross_val_results = cross_validate(svd_model_trained, ratings_by_users, measures=['RMSE', 'MAE', 'MSE'], cv=10, verbose=False)
test_mae = cross_val_results['test_mae']

# mean squared errors per fold
df_test_mae = pd.DataFrame(test_mae, columns=['Mean Absolute Error'])
df_test_mae.index = np.arange(1, len(df_test_mae) + 1)
df_test_mae.sort_values(by='Mean Absolute Error', ascending=False).head(15)

# plot an overview of the performance per fold
plt.figure(figsize=(6,4))
sns.set_theme(style="whitegrid")
sns.barplot(y='Mean Absolute Error', x=df_test_mae.index, data=df_test_mae, color="b")
# plt.title('Mean Absolute Error')
movie recommender performance

The chart above shows that the mean deviation of our predictions from the actual rating is a little below 0.7. This is not terrific, but ok for a first model. In addition, there are no significant differences between the performance in the different folds. Let’s keep in mind that the MAE says little about possible outliers in the predictions. However, since we are dealing with ordinal predictions (1-5), the influence of outliers is naturally limited.

Step #6: Generate Predictions

Finally, we will use our movie recommender to generate a list of suggested movies for a specific test user. The predictions will be based on the user’s previous movie ratings.

# predict ratings for a single user_id and for all movies
user_id = 400 # some test user from the ratings file

# create the predictions
pred_series= []
df_ratings_filtered = df_ratings[df_ratings['userId'] == user_id]

print(f'number of ratings: {df_ratings_filtered.shape[0]}')
for movie_id, name in zip(df_movies.index, df_movies['title']):
    # check if the user has already rated a specific movie from the list
    rating_real = df_ratings.query(f'movieId == {movie_id}')['rating'].values[0] if movie_id in df_ratings_filtered['movieId'].values else 0
    # generate the prediction
    rating_pred = svd_model_trained.predict(user_id, movie_id, rating_real, verbose=False)
    # add the prediction to the list of predictions
    pred_series.append([movie_id, name, rating_pred.est, rating_real])

# print the results
df_recommendations = pd.DataFrame(pred_series, columns=['movieId', 'title', 'predicted_rating', 'actual_rating'])
df_recommendations.sort_values(by='predicted_rating', ascending=False).head(15)

Alternatively, we can predict how well a specific user will rate a movie. In this case, we have to pass the user_id and the movie_id for which we want the model to make the prediction.

# predict ratings for the combination of user_id and movie_id
user_id = 217 # some test user from the ratings file
movie_id = 4002
rating_real = df_ratings.query(f'movieId == {movie_id} & userId == {user_id}')['rating'].values[0]
movie_title = df_movies[df_movies.index == 862]['title'].values[0]

print(f'Movie title: {movie_title}')
print(f'Actual rating: {rating_real}')

# predict and show the result
rating_pred = svd_model_trained.predict(user_id, movie_id, rating_real, verbose=True)

Summary

Congratulations! You now know how to develop a movie recommendation system in Python. The model developed in this article is an SVD model that uses matrix factorization and collaborative filtering to predict movie ratings for a given user. We also showed how we could perform cross-validation on the movies dataset and use the model to generate movie recommendations.

If you like the post, please let me know in the comments, and don’t forget to subscribe to our Twitter account to stay up to date on upcoming articles.

Sources and further Reading

Below you find some resources for further reading on recommender systems and collaborative filtering.

Books

Charu C. Aggarwal (2016) Recommender Systems
Kin Falk (2019) Practical Recommender Systems

Articles

Getting started with Suprise
About the history of Recommender Systems
Singular value decomposition vs. matrix factorization
Probabilistic Matrix Factorization

Author

  • Hi, I am Florian, a Zurich-based consultant for AI and Data. Since the completion of my Ph.D. in 2017, I have been working on the design and implementation of ML use cases in the Swiss financial sector. I started this blog in 2020 with the goal in mind to share my experiences and create a place where you can find key concepts of machine learning and materials that will allow you to kick-start your own Python projects.

Leave a Reply