Machine Learning Use Cases in the Retail Industry

Building a Virtual AI Assistant (aka Copilot) for Your Software Application: Harnessing the Power of LLMs like ChatGPT

Florian Follonier — Wed, 05 Jul 2023 12:45:27 +0000

Welcome to the dawn of a new era in digital interaction! With the advent of Generative AI, we’re witnessing a remarkable revolution that’s changing the very nature of how we interact with software and digital services. This change is monumental. Leading the charge are the latest generation of AI-powered virtual assistants, aka “AI copilots”. Unlike traditional narrow AI models, these are capable of understanding user needs, intents, and questions expressed in plain, natural language.

We are talking about nothing less but the next evolution in software design and user experience that is driven by recent advances in generative AI and Large Language Models (LLMs) like OpenAI’s ChatGPT, Google Bard, or Anthrophic’s Claude.

Thanks to LLMs user interactions are no longer bound by the constraints of a traditional user interface with forms and buttons. Whether it’s creating a proposal in Word, editing an image, or opening a claim in an insurance app, users can express their needs in natural language – a profound change in our interactions with software and services.

Despite the hype about these new virtual ai assistants, our understanding of how to build an LLM-powered virtual assistant remains scant. So, if you wonder how to take advantage of LLMs and build a virtual assistant for your app, this article is for you. This post will probe into the overarching components needed to create a virtual AI assistant. We will look at the architecture and its components including LLMs, Knowledge store, Cache, Conversational Logic, and APIs.

Also:

The new generation of virtual ai assistants inspires a profound change in the way we interact with software and digital services.

Virtual AI Assistants at the Example of Microsoft M365 Copilot

Advances in virtual AI assistants are closely linked to ChatGPT and other LLMs from US-based startup OpenAI. Microsoft has forged a partnership with OpenAI to bring the latest advances in AI to their products and services. Microsoft has announced these “Copilots” across major applications, including M365 and the Power Platform.

Here are some capabilities of these Copilots within M365:

In PowerPoint, Copilot allows users to create presentations based on a given context, such as a Word document, for example by stating “Create a 10-slide product presentation based on the following product documentation.“
In Word, Copilot can adjust the tone of writing a text or transform a few keywords into a complete paragraph. Simply type something like “Create a proposal for a 3-month contract for customer XYZ based on doc ADF.”
In Excel, Copilot helps users with analyzing datasets, as well as with creating or modifying them. For example, it can summarize a dataset in natural langue and describe trends.
Let’s not forget Outlook! Your new AI Copilot helps you organize your emails and calendar. It assists you in crafting email responses, scheduling meetings, and even provides summaries of key points from the ones you missed.

If you want to learn more about Copilot in M365, this youtube video provides an excellent overview. However, these are merely a handful of examples: Microsoft 365 Copilot Explained: How Microsoft Just Changed the Future of Work. The potential of AI copilots extends far beyond the scope of Office applications and can elevate any software or service to a new level. No wonder, large software companies like SAP, and Adobe, have announced plans to upgrade their products with copilot features.

Microsoft has announced a whole fleet of virtual AI assistants for its products. These range from copilots in M365 office apps to services of its Azure cloud platform.

How LLMs Enable a New Generation of Virtual AI Assistants

Virtual AI assistants are nothing but new. Indeed, their roots can be traced back to innovative ventures such as the paperclip assistant, Clippy, from Microsoft Word – a pioneering attempt at enhancing user experience. Later on, this was followed by the introduction of conventional chatbots.

Nonetheless, these early iterations had their shortcomings. Their limited capacity to comprehend and assist users with tasks outside of their defined parameters hampered their success on a larger scale. The inability to adapt to a wider range of user queries and requests kept these virtual ai assistants confined within their initial scope, restricting their growth and wider acceptance. So if we talk about this next generation of virtual ai assistants, what has truly revolutionized the scene? In essence, the true innovation lies in the emergence of LLMs such as OpenAI’s GPT4.

LLMs – A Game Changer for Conversational User Interface Design

Over time, advancements in machine learning, natural language processing, and vast data analytics transformed the capabilities of AI assistants. Modern AI models, like GPT-4, can understand context, engage in more human-like conversations, and offer solutions to a broad spectrum of queries. Furthermore, the integration of AI assistants into various devices and platforms, along with the increase in cloud computing, expanded their reach and functionality. These technological shifts have reshaped the scene, making AI assistants more adaptable, versatile, and user-friendly than ever before.

Take, for example, an AI model like GPT. A user might instruct, “Could you draft an email to John about the meeting tomorrow?” Not only would the AI grasp the essence of this instruction, but it could also produce a draft email seamlessly.

Yet, it’s not solely their adeptness at discerning user intent that sets LLMs apart. They also exhibit unparalleled proficiency in generating programmatic code to interface with various software functions. Imagine directing your software with, “Generate a pie chart that visualizes this year’s sales data by region,” and witnessing the software promptly fulfilling your command.

A Revolution in Software Design and User Experience

The advanced language understanding offered by LLMs unburdens developers from the painstaking task of constructing every possible dialog or function an assistant might perform. Rather, developers can harness the generative capabilities of LLMs and integrate them with their application’s API. This integration facilitates a myriad of user options without the necessity of explicitly designing them.

The outcome of this is far-reaching, extending beyond the immediate relief for developers. It sets the stage for a massive transformation in the software industry and the broader job market, affecting how developers are trained and what skills are prioritized. Furthermore, it alters our everyday interaction with technology, making it more intuitive and efficient.

Components of a Modern Virtual AI Assistant áka AI Copilot

By now you should have some idea of what modern virtual AI assistants are. Next, let’s look at the technical components that need to come together.

The illustration below displays the main components of an LLM-powered virtual AI assistant:

A – Conversational UI for providing the user with a chat experience
B – LLMs such as GPT-3.5 or GPT-4
C – Knowledge store for grounding your bot in enterprise data and dynamically providing few-shot examples.
D – Conversation logic for intent recognition and tracking conversations.
E – Application API as an interface to trigger and perform application functionality.
F – Cache for maintaining an instant mapping between often encountered user intents and structured LLM responses.

Let’s look at these components in more detail.

A) Conversational Application Frontend

Incorporating virtual AI assistants into a software application or digital service often involves the use of a conversational user interface, typically embodied in a chat window that showcases previous interactions. The seamless integration of this interface as an intrinsic part of the application is vital.

A lot of applications employ a standard chatbot methodology, where the virtual AI assistant provides feedback to users in natural language or other forms of content within the chat window. Yet, a more dynamic and efficacious approach is to merge natural language feedback with alterations in the traditional user interface (UI). This dual approach not only enhances user engagement but also improves the overall user experience.

Microsoft’s M365 Copilot is a prime example of this approach. Instead of simply feeding responses back to the user in the chat window, the virtual assistant also manipulates elements in the traditional UI based on user input. It may highlight options, auto-fill data, or direct the user’s attention to certain parts of the screen. This combination of dynamic UI manipulation and natural language processing creates a more interactive and intuitive user experience, guiding the user toward their goal in a more efficient and engaging way.

M365 Copilot chat window in M365 Office

When designing the UI for a virtual AI assistant, there are several key considerations. Firstly, the interface should be intuitive, ensuring users can easily navigate and understand how to interact with the AI. Secondly, the AI should provide feedback in a timely manner, so the user isn’t left waiting for a response. Thirdly, the system should be designed to handle errors gracefully, providing helpful error messages and suggestions when things don’t go as planned. Finally, the AI should keep the human in the loop and assist him in using AI in a safe way.

Also: Building “Chat with your Data” Apps using Embeddings, ChatGPT, and Cosmos DB for Mongo DB vCore

B) Large Language Model

At the interface between users and assistant sits the large language mode. It translates users’ requests and questions into code, actions, and responses that are shown to the user. Here, we are talking about foundational models like GPT-3.5-Turbo or GPT-4. In addition, if you are working with extensive content, you may use an embedding LLM that converts text or images into mathematical vectors as part of your knowledge store. An example, of such an embedding model, is ada-text-embeddings-002.

It’s important to understand that the user is not directly interacting with the LLM. Instead, you may want to put some control logic between the user and the LLM that steers the conversation. This logic can enrich prompts with additional data from the knowledge store or an online search API such as Google or Bing. This process of injecting data into a prompt depending on the user input is known as Retrieval Augmented Generation.

Typical tasks performed by the LLM:

Generating natural language responses based on the user’s query and the retrieved data from the knowledge store.
Recognizing and classifying user intent.
Generating code snippets (or API requests) that can be executed by the application or the user to achieve a desired outcome in your application.
Converting content into embeddings to retrieve relevant information from a vector-based knowledge store.
Generating summaries, paraphrases, translations, or explanations of the retrieved data or the generated responses.
Generating suggestions, recommendations, or feedback for the user to improve their experience or achieve their goals.

C) Knowledge Store

Let’s dive into the “Knowledge Store” and why it’s vital. You might think feeding a huge prompt explaining app logic to your LLM, like ChatGPT, would work, but that’s not the case. As of June 2023, LLMs have context limits. For instance, GPT-3 can handle up to 4k tokens, roughly three pages of text. This limitation isn’t just for input, but output too. Hence, cramming everything into one prompt isn’t efficient or quick.

Instead, pair your LLM with a knowledge store, like a vector database (more on this in our article on Vector Databases). Essentially, this is your system’s information storage, which efficiently retrieves data. Whichever storage you use, a search algorithm is crucial to fetch items based on user input. For vector databases, the typical way of doing this is by using similarity search.

Token Limitations

Curious about GPT models’ token limits? Here’s a quick breakdown:

GPT-3.5-Turbo Model (4,000 tokens): About 7-8 DIN A4 pages
GPT-4 Standard Model (8,000 tokens): Around 14-16 DIN A4 pages
GPT-3.5-Turbo-16K Model (16,000 tokens): Approximately 28-32 DIN A4 pages
GPT-4-32K Model (32,000 tokens): Estimated at 56-64 DIN A4 pages

D) Conversation Control Logic

Finally, the conversation needs a conductor to ensure it stays in harmony and doesn’t veer off the rails. This is the role of the conversation logic. An integral part of your app’s core software, the conversation logic bridges all the elements to deliver a seamless user experience. It includes several subcomponents. Meta prompts, for instance, help guide the conversation in the desired direction and provide some boundaries to the activities of the assistant. For example, the meta prompt may include a list of basic categories for intents that help the LLM with understanding what the user wants to do.

Another subcomponent is the connection to the knowledge store that allows the assistant to draw from a vast array of data to augment prompts handed over to the large language model. Moreover, the logic incorporates checks on the assistant’s activities and its generated content. These checks act like safety nets, mitigating risks and preventing unwanted outcomes. It’s akin to a quality control mechanism, keeping the assistant’s output in check and safeguarding against responses that might derail the user’s experience or even break the application.

E) Application API

Users expect their commands to initiate actions within your application. To fulfill these expectations, the application needs an API that can interact with various app functions. Consider the API as the nerve center of your app, facilitating access to its features and user journey. This API enables the AI assistant to guide users to specific pages, fill in forms, execute tasks, display information, and more. Tools like Microsoft Office even have their own language for this, while Python code, SQL statements, or generic REST requests usually suffice for most applications.

Applications based on a microservice architecture have an edge in this regard, as APIs are inherent to their design. If your application misses some APIs, remember, there’s no rush to provide access to all functions from the outset. You can start by supporting basic functionalities via chat and gradually expand over time. This allows you to learn from user interactions, continuously refine your offering, and ensure your AI assistant remains a useful and efficient tool for your users.

So, now that we’ve laid down the foundation, let’s buckle up and take a journey through the workflow of a modern virtual assistant. Trust me, it’s a fascinating trip ahead!

F) Cache

Implementing a cache into your virtual AI assistant can significantly boost performance and decrease response times. Particularly useful for frequently recurring user intents, caching stores the outcomes of these intents for quicker access in future instances. However, a well-designed cache shouldn’t directly store specific inputs as there is too much variety in the human language. Instead, caching could be woven into the application’s logic in the mid-layers of your OpenAI prompt flow.

This strategy ensures frequently repeated intents are handled more swiftly, enhancing user experience. It’s critical to remember that cache integration is application-specific, and thoughtful design is vital to avoid unintentionally inducing inefficiencies.

While a well-implemented cache can speed up responses, it also introduces additional complexity. Effective cache management is crucial for avoiding resource drains, requiring strategies for data storage duration, updates, and purging.

The exact impact and efficiency of this caching strategy will depend on your application specifics, including the distribution and repetition of user intents. In the upcoming articles, we’ll explore this topic further, discussing efficient cache integration in AI assistant systems.

An example of a caching technology would be Redis.

Considerations on the Architecture of Virtual AI Assistants

Designing an virtual AI assistant is an intricate process that blends cutting-edge technology with a keen understanding of user behavior. It’s about creating an efficient tool that not only simplifies tasks and optimizes workflows but also respects and preserves user autonomy. This section of our article will delve into the key considerations that guide the architecture of a virtual AI assistant. We’ll discuss the importance of user control, the strategic selection and use of GPT models, the benefits of starting simple, and the potential expansion as you gain confidence in your system’s stability and efficiency. As we journey through these considerations, remember the ultimate goal: creating a virtual AI assistant that augments user capabilities, enhances user experience, and breathes new life into software applications.

Keep the User in Control

At the heart of any virtual AI assistant should be the principle of user control. While automation can optimize tasks and streamline workflows, it is crucial to remember that your assistant is there to assist, not usurp. Balancing AI automation with user control is essential to crafting a successful user experience.

Take, for instance, the scenario of a user wanting to open a support ticket within your application. In this situation, your assistant could guide the user to the correct page, auto-fill known details like the user’s name and contact information, and even suggest possible problem categories based on the user’s descriptions. By doing so, the virtual AI assistant has significantly simplified the process for the user, making it quicker and less burdensome.

However, the user retains control throughout the process, making the final decisions. They can edit the pre-filled details, choose the problem category, and write the issue description in their own words. They’re in command, and the virtual AI assistant is there to assist, helping to avoid errors, speed up the process, and generally make the experience smoother and more efficient.

This balance between user control and AI assistance is not only about maintaining a sense of user agency; it is also about trust. Users need to trust that the AI is there to help them, not to take control away from them. If the AI seems too controlling or makes decisions that the user disagrees with, this can erode trust and hinder user acceptance.

Mix and Match Models

Another crucial consideration is the use of different GPT models. Each model comes with its own set of strengths, weaknesses, response times, costs, and token limits. It’s not just about capabilities. Sometimes, it’s unnecessary to deploy a complex GPT-4 model for simpler tasks in your workflow. Alternatives like ADA or GPT 3.5 Turbo might be more suitable and cost-effective for functions like intent recognition.

Reserve the heavy-duty models for tasks requiring an extended token limit or dealing with complex operations. One such task is the final-augmented prompt that creates the API call. If you’re working with a vector database, you’ll also need an embedding model. Be mindful that these models come with different vector sizes, and once you start building your database with a specific size, it can be challenging to switch without migrating your entire vector content.

Think Big but Start Simple

It’s always a good idea to start simple – maybe with a few intents to kick things off. As you gain experience and confidence in building virtual assistant apps, you can gradually integrate additional intents and API calls. And don’t forget to keep your users involved! Consider incorporating a feedback mechanism, allowing users to report any issues and suggest improvements. This will enable you to fine-tune your prompts and database content effectively.

As your application becomes more comprehensive, you might want to explore model fine-tuning for specific tasks. However, this step should be considered only when your virtual AI assistant functionality has achieved a certain level of stability. Fine-tuning a model can be quite costly, especially if you decide to change the intent categories after training.

Digital LLM-based Assistants – A Major Business Opportunity

From a business standpoint, upgrading software products and services with LLM-powered virtual AI assistants presents a significant opportunity to differentiate in the market and even innovate their business model. Many organizations are already contemplating the inclusion of virtual assistants as part of subscription packages or premium offerings. As the market evolves, software lacking a natural language interface may be perceived as outdated and struggle to compete.

AI-powered virtual assistants are likely to inspire a whole new generation of software applications and enable a new wave of digital innovations. By enhancing convenience and efficiency in user inputs, virtual assistants unlock untapped potential and boost productivity. Moreover, they empower users to fully leverage the diverse range of features offered by software applications, which often remain underutilized.

I strongly believe that LLM-driven virtual AI assistants are the next milestone in software design and will revolutionize software applications across industries. And remember, this is just the first generation of virtual assistants. The future possibilities are virtually endless and we can’t wait to see what’s next! Indeed, the emergence of natural language interfaces is expected to trigger a ripple effect of subsequent innovations, for example, in areas such as standardization, workflow automation, and user experience design.

Summary

In this article, we delved into the fascinating world of virtual AI assistants, powered by LLMs. We started by exploring how the advanced language understanding of LLMs is revolutionizing software design, easing the workload of developers, and reshaping user experiences with technology.

Next, we provided an overview of the key architectural components of a modern virtual AI assistant: the Conversational Application Frontend, Large Language Model, Knowledge Store, and Conversation Control Logic. We also introduced the concept of an Application API and the novel idea of a Cache for storing and quickly retrieving common user intents. Each component was discussed in the context of their roles and how they work together to create a seamless, interactive, and efficient user experience.

We then discussed architecture considerations, emphasizing the necessity of maintaining user control while leveraging the power of AI automation. We talked about the judicious use of different GPT models based on task requirements, the advantages of starting with simple implementations and progressively scaling up, and the benefits of user feedback in continuously refining the system.

This journey of ‘AI in Software Applications’, from concept to reality, isn’t just about innovation. It’s about unlocking ‘Innovative Business Models with AI’ and boosting user engagement and productivity. As we continue to ride the wave of ‘Natural Language Processing for Software Automation’, the opportunities for harnessing the power of virtual AI assistants are endless. Stay tuned as we explore the workflows further in the next article.

In this article, we have gone through the components of an LLM-powered virtual assistant aka “AI copilot”. In the next article, we will dive deeper into the processing logic and follow a prompt into the engine of an intelligent assistant.

Sources and Further Reading

The post Building a Virtual AI Assistant (aka Copilot) for Your Software Application: Harnessing the Power of LLMs like ChatGPT appeared first on relataly.com.

Building “Chat with your Data” Apps using Embeddings, ChatGPT, and Cosmos DB for Mongo DB vCore

Florian Follonier — Sat, 27 May 2023 13:25:08 +0000

Artificial Intelligence (AI), in particular, the advent of OpenAI’s ChatGPT, has revolutionized how we interact with technology. Chatbots powered by this advanced language model can engage users in intricate, natural language conversations, marking a significant shift in AI capabilities. However, one thing that ChatGPT isn’t designed for is integrating personalized or proprietary knowledge – it’s built to draw upon general knowledge, not specifics about you or your organization. That’s where the concept of Retrieval Augmented Generation (RAG) comes into play. This article explores the exciting prospect of building your own ChatGPT that lets users ask questions on a custom knowledge base.

In this tutorial, we’ll unveil the mystery behind enterprise ChatGPT, guiding you through the process of creating your very own custom ChatGPT – an AI-powered chatbot based on OpenAI’s powerful Generative Pretrained Transformers (GPT) technology. We’ll use Python and delve into the world of vector databases, specifically, Mongo API for Azure Cosmos DB, to show you how you can make a large knowledgebase available to ChatGPT that can go way beyond the typical token limitation of GPT models.

For experts, AI fans, or tech newbies, this guide simplifies building your ChatGPT. With clear instructions, useful examples, and tips, we aim to make it informative and empowering.

We’ll explore AI, showing you how to customize your chatbot. We’ll simplify complex concepts and show you how to start your AI adventure from home or office. Ready to start this exciting journey? Keep reading!

Also:

Note on the use of Vector DBs and Costs.

Please note that this tutorial describes a business use case that utilizes a Cosmos DB for Mongo DB vCore hosted on the Azure cloud.

Alternatively, you can set up an open-source vector database on your local machine, such as Milvus. Be aware that certain code adjustments will be necessary to proceed with the open-source alternative.

Why Custom ChatGPT is so Powerful and Versatile

I believe we have all tested ChatGPT, and probably like me, you have been impressed by its remarkable capabilities. However, ChatGPT has a significant limitation: it can only answer questions and perform tasks based on the public knowledge base it was trained on.

Imagine having a chatbot based on ChatGPT that communicates effectively and truly understands the nuances of your business, sector, or even a particular topic of interest. That’s the power of a custom ChatGPT. A tailor-made chatbot allows for specialized conversations, providing the needed information and drawing from a unique database you’ve developed.

This becomes particularly beneficial in industries with specific terminologies or when you have a large database of knowledge that you want to make easily accessible and interactive. A custom ChatGPT, with its personalized and relevant responses, ensures a better user experience, effectively saving time and increasing productivity.

Let’s delve into how to build such a solution. Spoiler it does not work by putting all the content into the prompt. But there is a great alternative.

Understanding the Building Blocks of Custom ChatGPT with Retrieval Augmented Generation

The foundational technology behind ChatGPT is OpenAI’s Generative Pre-trained Transformer models (GPT). These models understand language by predicting the next word in a sentence and are trained on a diverse range of internet text. However, the GPT models, such as the GPT-3.5, have a limitation of processing 4096 tokens at a time. A token in this context is a chunk of text which can be as small as one character or as long as one word. For example, the phrase “ChatGPT is great” is four tokens long.

Another challenge with Foundation Models such as ChatGPT is that they are trained on large-scale datasets that were available at the time of their training. This means they are not aware of any data created after their training period. Also, because they’re trained on broad, general-domain datasets, they may be less effective for tasks requiring domain-specific knowledge.

How Retrieval Augmented Generation (RAG) Helps

Retrieval-Augmented Generation (RAG) is a method that combines the strength of transformer models with external knowledge to augment their understanding and applicability. Here’s a brief explanation:

To address this, RAG retrieves relevant information from an external data source and uses this information to augment the input to the foundation model. This can make the model’s responses more informed and relevant.

Data Sources

The external data can come from various sources like databases, document repositories, or APIs. To make this data compatible with the RAG approach, both the data and user queries are converted into numerical representations (embeddings) using language models.

Data Preparation as Embeddings

The embeddings, which are essentially vectors, need to be stored in a database that’s efficient at storing and searching through these high-dimensional data. This is where Azure’s Cosmos Mongo DB comes into play. It’s a vector search database specifically designed for this task.

To circumvent the token limitation and make your extensive data available to ChatGPT, we turn the data into embeddings. These are mathematical representations of your data, converting words, sentences, or documents into vectors. The advantage of using embeddings is that they capture the semantic meaning of the text, going beyond keywords to understand the context. In essence, similar information will have similar vectors, allowing us to cluster related information together and separate them from a semantically different text.

Storing the Data in Vector Databases

Matching Queries to Knowledge

The RAG model compares the embeddings of user queries with those in the knowledge base to identify relevant information. The user’s original query is then augmented with context from similar documents in the knowledge base.

Input to the Foundation Model

This augmented input is sent to the foundation model, enhancing its understanding and response quality.

Updates

Importantly, the knowledge base and associated embeddings can be updated asynchronously, ensuring that the model remains up-to-date even as new information is added to the data sources.

In sum, RAG extends the utility of foundation models by incorporating external, up-to-date, domain-specific knowledge into their understanding and output.

By incorporating these components, you’ll be creating a robust custom ChatGPT that not only understands the user’s queries but also has access to your own information, giving it the ability to respond with precision and relevance.

Ready to dive into the technicalities? Stay tuned!

A tailor-made chatbot allows for specialized conversations, providing the exact information needed, drawing from a unique database that you’ve developed.

Building the Custom “Chat with Your Data” App in Python

Now that we’ve discussed the theory behind building a custom ChatGPT and seen some exciting real-world applications, it’s time to put our knowledge into action! In this practical segment of our guide, we’re going to demonstrate how you can build a custom ChatGPT solution using Python.

Our project will involve storing a sample PDF document in Cosmos Mongo DB and developing a chatbot capable of answering questions based on the content of this document. This practical exercise will guide you through the entire process, including turning your PDF content into embeddings, storing these embeddings in the Cosmos Mongo DB, and finally integrating it all with ChatGPT to build an interactive chatbot.

If you’re new to Python, don’t worry, we’ll be breaking down the code and explaining each step in a straightforward manner. Let’s roll up our sleeves, fire up our Python environments, and get coding! Stay tuned as we embark on this exciting hands-on journey into the world of custom chatbots.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

How to Set Up Vector Search in Cosmos DB

First, you must understand that you will need a database to store the embeddings. It does not necessarily have to be a vector database. Still, this type of database will make your solution more performant and robust, particularly when you want to store large amounts of data.

Azure Cosmos DB for MongoDB vCore is the first MongoDB-compatible offering to feature Vector Search. With this feature, you can store, index, and query high-dimensional vector data directly in Azure Cosmos DB for MongoDB vCore, eliminating the need for data transfer to alternative platforms for vector similarity search capabilities. Here are the steps to set it up:

Choose Your Azure Cosmos DB Architecture: Azure Cosmos DB for MongoDB provides two types of architectures, RU-based and vCore-based. Each has its strengths and is best suited for certain types of applications. Choose the one that best fits your needs. If you’re looking to lift and shift existing MongoDB apps and run them as-is on a fully supported managed service, the vCore-based option could be the perfect fit.
Configure Your Vector Search: Once your database architecture is set up, you can integrate your AI-based applications, including those using OpenAI embeddings, with your data already stored in Cosmos DB.
Build and Deploy Your AI Application: With the Vector Search set up, you can now build your AI application that takes advantage of this feature. You can create a Go app using Azure Cosmos DB for MongoDB or deploy Azure Cosmos DB for MongoDB vCore using a Bicep template as suggested next steps.

Azure Cosmos DB for MongoDB vCore’s Vector Search feature is a game-changer for AI application development. It enables you to unlock new insights from your data, leading to more accurate and powerful applications.

Cosmos DB for Mongo DB Usage Models

Regarding Cosmos DB for Mongo DB, there are two options to choose from: Request Unit (RU) Database Account and vCore Cluster. Each option follows a different pricing model to suit diverse needs.

The Request Unit (RU) Database Account operates on a pay-per-use basis. With this model, you are billed based on the number of requests and the level of provisioned throughput consumed by your workload.

As of 27th Mai 2023, the brand new vector search function is only available for the vCore Cluster option, which is why we will use this setup for this tutorial. The vCore Cluster offers a reserved managed instance. Under this option, you are charged a fixed amount on a monthly basis, providing more predictable costs for your usage.

Once you have created your vCore instance, you must collect your connection string and make it available to your Python script. You can do this either by storing it in Azure Key Vault (which I would recommend) or by storing it locally on your computer or in the code (which I would not recommend for obvious security reasons).

When it comes to Cosmos DB for Mongo DB, there are two options to choose from: Request Unit (RU) Database Account and vCore Cluster.

Azure Cosmos DB for Mongo DB is a new offering that is designed explicitly for vector use cases (incl. embeddings)

Using other Vector Databases

While Cosmos DB is a popular choice for vector databases, I would like to note that other options are available in the market. You can still benefit from this tutorial if you decide to utilize a different vector database, such as Pinncecone or Chroma. However, it is necessary to make code adjustments tailored to the APIs and functionalities of the specific vector database you choose.

Specifically, you will need to modify the “insert embedding functions” and “similarity search functions” to align with the requirements and capabilities of your chosen vector database. These functions typically have variations that are specific to each vector database.

By customizing the code according to your selected vector database’s API, you can successfully adapt the tutorial to suit your specific database choice. This allows you to leverage the principles and concepts this tutorial covers, regardless of the vector database you opt for.

Also: Vector Databases: The Rising Star in Generative AI Infrastructure

Prerequisites

Before diving into the code, it’s essential to ensure that you have the proper setup for your Python 3 environment and have installed all the necessary packages. If you do not have a Python environment, follow the instructions in this tutorial to set up the Anaconda Python environment. This will provide you with a robust and versatile environment well-suited for machine learning and data science tasks.

In this tutorial, we will be working with several libraries:

openai
pymongo
PyPDF2
dotenv

Should you decide to use Azure Key Vault, then you also need the following Python libraries:

azure-identity
azure-key-vault

You can install the OpenAI Python library using console commands:

pip install openai
conda install openai (if you are using the Anaconda packet manager)

Step #1 Authentification and DB Setup

Let’s start with the authentification and setup of the API keys. After making necessary imports, the code gets things read to connect to essential services – OpenAI and Cosmos DB – and makes sure it can access these services properly.

Fetching Credentials: The script starts by setting up a connection to a service called Azure Key Vault to retrieve some crucial credentials securely. These are like “passwords” that the script needs to access various resources.
Setting Up AI Services: Then, it prepares to connect to two different AI services. One is a version that’s hosted by Azure, and the other is the standard, public version.
Establishing Database Connection: Lastly, the script sets up a connection to a database service, specifically to a certain collection within the Cosmos DB database. The script also checks if the connection to the database was successful by sending a “ping” – if it receives a response, it knows the connection is good.

from azure.identity import AzureCliCredential
from azure.keyvault.secrets import SecretClient
import openai
import logging
import tiktoken
import pandas as pd
import pymongo
from dotenv import load_dotenv
load_dotenv()
# Set up the Azure Key Vault client and retrieve the Blob Storage account credentials
keyvault_name = ''
openaiservicename = ''
client = SecretClient(f"https://{keyvault_name}.vault.azure.net/", AzureCliCredential())
print('keyvault service ready')
# AzureOpenAI Service
def setup_azureopenai():
    openai.api_key = client.get_secret('openai-api-key').value
    openai.api_type = "azure"
    openai.api_base = f'https://{openaiservicename}.openai.azure.com'
    openai.api_version = '2023-05-15'
    print('azure openai service ready')
# public openai service
def setup_public_openai():
    openai.api_key = client.get_secret('openai-api-key-public').value
    print('public openai service ready')
DB_NAME = "hephaestus"
COLLECTION_NAME = 'isocodes'
def setup_cosmos_connection():
    COSMOS_CLUSTER_CONNECTION_STRING = client.get_secret('cosmos-cluster-string').value
    cosmosclient = pymongo.MongoClient(COSMOS_CLUSTER_CONNECTION_STRING)
    db = cosmosclient[DB_NAME]
    collection = cosmosclient[DB_NAME][COLLECTION_NAME]
    # Send a ping to confirm a successful connection
    try:
        cosmosclient.admin.command('ping')
        print("Pinged your deployment. You successfully connected to MongoDB!")
    except Exception as e:
        print(e)
    return collection, db
setup_public_openai()
collection, db = setup_cosmos_connection()

Now we have set things up to interact with our Cosmos DB Mong DB vCore instance.

Step #2 Functions for Populating the Vector DB

Next, we prepare and insert data into the database as embeddings. First, we prepare the content. The preparation process involves turning the text content into embeddings. Each embedding is a list of flats representing the meaning of a specific part of the text in a way the AI system can understand.

We create the embeddings by sending text (for example, a paragraph of a document) to an OpenAI embedding model that returns the embedding. There are two options for using OpenAI: You can use the Azure OpenAI engine and deploy your own Ada embedding model. Alternatively, you can use the public OpenAI Ada embedding model.

We’ll use the public OpenAI’s text-embedding-ada-002. Remember that the model is designed to return embeddings, not text. Model inference may incur costs based on the data processed. Refer to OpenAI or Azure OpenAI service for pricing details.

Finally, the code inserts the prepared requests (which now include both the original text and the corresponding embeddings) into the database. The function returns the unique IDs assigned to these newly inserted items in the database. In this way, the code processes and stores the necessary information in the database for later use.

# prepare content for insertion into cosmos db
def prepare_content(text_content):
  embeddings = create_embeddings_with_openai(text_content)
  request = [
    {
    "textContent": text_content, 
    "vectorContent": embeddings}
  ]
  return request
# create embeddings
def create_embeddings_with_openai(input):
    #print('Generating response from OpenAI...')
    ###### uncomment for AzureOpenAI model usage and comment code below
    # embeddings = openai.Embedding.create( 
    #     engine='', 
    #     input=input)["data"][0]["embedding"]
    ###### public openai model usage and comment code above
    embeddings = openai.Embedding.create(
        model='text-embedding-ada-002', 
        input=input)["data"][0]["embedding"]
    
    # Number of embeddings    
    # print(len(embeddings))
    return embeddings
# insert the requests
def insert_requests(text_input):
    request = prepare_content(text_input)
    return collection.insert_many(request).inserted_ids
# Creates a searchable index for the vector content
def create_index():
  
  # delete and recreate the index. This might only be necessary once.
  collection.drop_indexes()
  embedding_len = 1536
  print(f'creating index with embedding length: {embedding_len}')
  db.command({
    'createIndexes': COLLECTION_NAME,
    'indexes': [
      {
        'name': 'vectorSearchIndex',
        'key': {
          "vectorContent": "cosmosSearch"
        },
        'cosmosSearchOptions': {
          'kind': 'vector-ivf',
          'numLists': 100,
          'similarity': 'COS',
          'dimensions': embedding_len
        }
      }
    ]
  })
# Resets the DB and deletes all values from the collection to avoid dublicates
#collection.delete_many({})

Step #3 Document Cracing and Populating the DB

The next step is to break down the PDF document into smaller chunks of text (in this case, ‘records’) and then process these records for future use. You can repeat this process for any document that you want to make available to OpenAI.

You can use any PDF that you like as long as you it contains readable text (use OCR). For demo purposes, I will use a tax document from Zurich. Put the document in the folder data/vector_db_data/ in your root folder and provide the name to the Python script.

Want to read in many documents at once? If you want to insert many documents, read the pdf documents from the folder and use the names to populate a list. You can then surround the insert function with a for loop that iterates through the list of document names

#3.1 Document Slicing Considerations

To convert a PDF into embeddings, the first step is to divide it into smaller content slices. The slicing process plays a crucial role as it affects the information provided to the OpenAI GPT model when answering user questions. If the slices are too large, the model may encounter token limitations. Conversely, if they are too small, the model may not receive sufficient content to answer the question effectively. It is important to strike a balance between the number of slices and their length to optimize the results, considering that the search process may yield multiple outcomes.

There are several approaches to handle the slicing process. One option is to define the slices based on a specific number of sentences or paragraphs. Alternatively, you can iteratively slice the document, allowing for some overlap between the data in the vector database. This approach has the advantage of providing more precise information to answer questions, but it also increases the data volume in the vector database, which can impact speed and cost considerations.

#3.2 Running the code below to crack a document and insert embeddings into the vector DB

Running the code below will first define a function that breaks text into separate paragraphs based on line breaks. Another function slices the PDF into records. Each record contains a certain number of sentences (the maximum is defined by the ‘max_sentences’ value). We use a Python library called PyPDF2 to extract text from each page of the PDF and Python’s built-in regular expressions to split the text into sentences and paragraphs. Note that if you want to achieve better results, you could also use a professional document content extraction tool such as Azure form recognizer.

The code then opens a specific PDF file (‘zurich_tax_info_2023.pdf’) and slices it into records, each containing no more than a certain number of sentences (as defined by’max_sentences’). After that, the function inserts these records into the vector database. Finally, we print the count of documents in the database collection. This shows how many pieces of data are already stored in this specific part of the database.

# document cracking function to insert data from the excel sheet
def split_text_into_paragraphs(text):
    paragraphs = re.split(r'\n{2,}', text)
    return paragraphs
def slice_pdf_into_records(pdf_path, max_sentences):
    records = []
    
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        
        for page in reader.pages:
            text = page.extract_text()
            paragraphs = split_text_into_paragraphs(text)
            
            current_record = ''
            sentence_count = 0
            
            for paragraph in paragraphs:
                sentences = re.split(r'(?<=[.!?])\s+', paragraph)
                
                for sentence in sentences:
                    current_record += sentence
                    
                    sentence_count += 1
                    
                    if sentence_count >= max_sentences:
                        records.append(current_record)
                        current_record = ''
                        sentence_count = 0
                
                if sentence_count < max_sentences:
                    current_record += ' '  # Add space between paragraphs
            
            # If there is remaining text after the loop, add it as a record
            if current_record:
                records.append(current_record)
    
    return records
# get file from root/data folder
pdf_path = '../data/vector_db_data/zurich_tax_info_2023.pdf'
max_sentences = 20  # Adjust the slice size as per your requirement
result = slice_pdf_into_records(pdf_path, max_sentences)
# print the length of result
print(f'{len(result)} vectors created with maximum {max_sentences} sentences each.')
# Print the sliced records
for i, record in enumerate(result):
    insert_requests(record)
    if i < 5:
        print(record[0:100])
        print('-------------------')
create_index()
print(f'number of records in the vector DB: {collection.count_documents({})}')

After slicing the document and inserting the embeddings into the vector database, we can proceed with functions for similarity search and prompting.

Step #4 Functions for Similarity Search and Prompts to ChatGPT

This section of code provides a set of functions to perform a vector search in the Cosmos DB, make a request to the ChatGPT 3.5 Turbo model for generating responses, and create prompts for the OpenAI model to use in generating those responses.

#4.1 How the Search Part Works

Allow me to provide a concise explanation of how the search process operates. We have now reached the stage where a user poses a question, and we utilize the OpenAI model to supply an answer, drawing from our vector database. Here, it’s vital to understand that the model transforms the question into embeddings and subsequently scours the knowledge base for similar embeddings that align with the information requested in the user’s prompt.

The vector database yields the most suitable results and inserts them into another prompt tailored for ChatGPT. This model, distinct from the embedding model, generates text. Thus, the final interaction with the ChatGPT model incorporates both the user’s question and the results from the vector database, which are the most fitting responses to the question. This combination should ideally aid the model in providing the appropriate answer. Now, let’s turn our attention to the corresponding code.

#4.2 Setting up the Functions for Vector Search

The vector_search function takes as input a query vector (representing a user’s question in vector form) and an optional parameter to limit the number of results. It then conducts a search in the Cosmos DB, looking for entries whose vector content is most similar to the query vector.

Next, the openai_request function makes a request to OpenAI’s ChatGPT 3.5 Turbo model to generate a response. This function takes a formatted conversation history (or ‘prompt’) and sends it to the model, which then generates a response. The content of the generated response is then returned.

The create_tweet_prompt function constructs the conversation history for the OpenAI model. This function takes the user’s question and a JSON object containing results from a database search and constructs a list of system and user messages. This list will then serve as the prompt for the ChatGPT model, instructing it to generate a response that answers the user’s question about tax, with the added guideline that the response should be in the same language as the question. The constructed prompt is then returned by the function.

# Cosmos DB Vector Search API Command
def vector_search(vector_query, max_number_of_results=2):
  results = collection.aggregate([
    {
      '$search': {
        "cosmosSearch": {
          "vector": vector_query,
          "path": "vectorContent",
          "k": max_number_of_results
        },
      "returnStoredSource": True
      }
    }
  ])
  return results
# openAI request - ChatGPT 3.5 Turbo Model
def openai_request(prompt, model_engine='gpt-3.5-turbo'):
    completion = openai.ChatCompletion.create(model=model_engine, messages=prompt, temperature=0.2, max_tokens=500)
    return completion.choices[0].message.content
# define OpenAI Prompt for News Tweet
def create_prompt(user_question, result_json):
    instructions = f'You are an assistant that answers questions based on sources provided. \
    If the information is not in the provided source, you answer with "I don\'t know". '
    task = f"{user_question} Translate the response to english /n \
    source: {result_json}"
    
    prompt = [{"role": "system", "content": instructions }, 
              {"role": "user", "content": task }]
    return prompt

You can easily change the voice and tone in which the ChatGPT answers questions by including the respective instructions in the create_prompt function.

Also: ChatGPT Style Guide: Understanding Voice and Tone Prompt Options for Engaging Conversations

Step #5 Testing the Custom ChatGPT Solution

This part of the code works with the previous functions to facilitate a complete question-answering cycle with Cosmos DB and OpenAI’s ChatGPT 3.5 Turbo model.

Now comes the most exciting part. Testing the solution, you can define a question and then execute the code below to run the search process.

# define OpenAI Prompt 
users_question = "When do I have to submit my tax return?"
# generate embeddings for the question
user_question_embeddings = create_embeddings_with_openai(user_question)
# search for the question in the cosmos db
search_results = vector_search(user_question_embeddings, 1)
print(search_results)
# prepare the results for the openai prompt
result_json = []
# print each document in the result
# remove all empty values from the results json
search_results = [x for x in search_results if x]
for doc in search_results:
    display(doc.get('_id'), doc.get('textContent'), doc.get('vectorContent')[0:5])
    result_json.append(doc.get('textContent'))
# create the prompt
prompt = create_prompt(user_question, result_json)
display(prompt)
# generate the response
response = openai_request(prompt)
display(f'User question: {users_question}')
display(f'OpenAI response: {response}')

‘User question: When do I have to submit my tax return?’

'OpenAI response: When do I have to submit my tax return? \n\nAll natural persons who had their residence in the canton of Zurich on December 31, 2022, or who owned properties or business premises (or business operations) in the canton of Zurich, must submit a tax return for 2022 in the calendar year 2023. Taxpayers with a residence in another canton also have to submit a tax return for 2022 in the calendar year 2023 if they ended their tax liability in the canton of Zurich by giving up a property or business premises during the calendar year 2022. If you turned 18 in the tax period 2022 (persons born in 2004), you must submit your own tax return (for the tax period 2022) for the first time in the calendar year 2023.'

As of Mai 2023, the knowledge base of ChatGPT 3.5 is limited to the timeframe before September 2021. So it’s evident that the response of our custom ChatGPT solution is based on the individual information provided in the vector database. Remember that we did not fine-tune the GPT model, so the model itself does not inherently know anything about your private data and instead uses the data that was dynamically provided to it as part of the prompt.

Real-world Applications of Chat with your data

Custom ChatGPT boosts efficiency, personalizes services, and improves experiences across industries. Here are some examples:

Customer Support: Companies can use ChatGPT for 24/7 customer service. With data from manuals, FAQs, and support docs, it delivers fast, accurate answers, enhancing customer satisfaction and lessening staff workload.
Healthcare: ChatGPT can respond to patient questions using medical texts and care guidelines. It offers data on symptoms, treatments, side effects, and preventive care, helping both healthcare providers and patients.
Legal Sector: Law firms can use ChatGPT with legal texts, court decisions, and case studies for answering legal questions, offering case references, or explaining legal terms.
Financial Services: Banks can use ChatGPT to extend their customer service and give customers advice based on their individual financial situation.
E-Learning: Schools and e-learning platforms can use ChatGPT to tutor students. Using textbooks, notes, and research papers, it helps students understand complex topics, solve problems, or guide them through a course.

In short, any sector needing a large information database for queries or services can use custom ChatGPT. It enhances engagement and efficiency by offering personalized experiences.

Summary

In this comprehensive guide, we’ve journeyed through the fascinating process of creating a customized ChatGPT that lets users chat with your business data. We started with understanding the immense value a tailored ChatGPT brings to the table and dove into its ability to produce specialized responses sourced from a custom knowledge base. This tailored approach enhances user experiences, saves time, and bolsters productivity.

We went behind the scenes to reveal the vital elements of crafting a custom ChatGPT: OpenAI’s GPT models, data embeddings, and vector databases like Cosmos DB for Mongo DB vCore. We clarified how these components synergize to transcend the token limitations inherent to GPT models. By integrating the components in Python, we broadened ChatGPT’s ability to answer queries based on your private knowledgebase, thereby offering contextually appropriate responses.

I hope this tutorial was able to illustrate the business value of ChatGPT and its versatile utility across a variety of sectors, including customer service, healthcare, legal services, finance, e-learning, and CRM data analytics. Each instance emphasized the transformative potential of a personalized ChatGPT in delivering efficient, targeted solutions.

I hope you found this helpful article. If you have any questions or remarks, please drop them in the comment section.

Sources and Further Reading

Azure Cosmos DB
OpenAI pricing
Azure OpenAI
Semantic search
What are embeddings?
Using vector search on embeddings in Azure Cosmos DB for MongoDB vCore
OpenAI ChatGPT helped to revise this article
Images created with Midjourney

The post Building “Chat with your Data” Apps using Embeddings, ChatGPT, and Cosmos DB for Mongo DB vCore appeared first on relataly.com.

Using LLMs (OpenAI’s ChatGPT) to Streamline Digital Experiences

Florian Follonier — Mon, 27 Mar 2023 08:35:21 +0000

In the age of information overload, finding what you need quickly and efficiently is more important than ever. OpenAI’s GPT technology has the potential to reduce friction between products and services, making it easier for individuals and businesses to find what they need. In this article, we’ll explore some specific examples of how OpenAI is already making a difference and what we can expect in the future.

Also:

What is Meant With Digital Friction?

Digital friction refers to any obstacles or inefficiencies that users may encounter when interacting with digital products or services. This can include things like slow-loading websites, confusing user interfaces, or difficult-to-navigate online forms.

Essentially, any aspect of a digital product or service that makes it more difficult or frustrating for users to achieve their desired outcome can be considered a form of digital friction. Reducing digital friction is a key goal of many businesses and organizations, as it can help to improve user satisfaction, drive conversions, and increase overall engagement with digital products and services.

But what about the friction that still exists when we have to Google something? Is that already a form of friction? I would argue yes. The bias of search engines towards paid results can also create friction for users seeking unbiased information. As a result, it has become common for users to search on Google and other websites separately to gain a comprehensive overview of products or services since search engines may not provide a complete picture. Having to enter a search query into a search engine manually is a form of friction that GPT can help eliminate.

Also: Mastering Prompt Engineering for ChatGPT

Examples of How OpenAI GPT Reduces Digital Friction

With OpenAI GPT models, we can expect to see a future where we no longer have to search for information, but rather, it will be readily available to us through natural language conversations with our devices.

By leveraging the power of artificial intelligence, large generative language models are capable of reducing any friction between products and services, enabling users to access information quickly and effortlessly. The key is their ability to understand and reason over natural language.

Users can communicate with GPT models using natural language and express their intent.
GPT can go through large amounts of data and return an aggregated result

As we continue to see advancements in artificial intelligence, I believe we are on the cusp of a new era of technology that will redefine how we interact with the world around us. With OpenAI GPT, we can expect to see a future where accessing information is no longer a chore, but rather a seamless and intuitive experience. In the following, we will discuss four examples.

Also: 9 Powerful Applications of OpenAI’s ChatGPT and Davinci for Your Business

With the rise of OpenAI GPT, we can expect to see even more seamless interactions between products and services.

Customer Service ChatBots

Before OpenAI, chatbots were often limited in their ability to understand and respond to customer inquiries. They relied on keyword matching and pre-programmed responses, which could lead to frustration for customers who weren’t getting the help they needed. With OpenAI, chatbots can now use natural language processing to understand the context of a customer’s inquiry and provide more accurate and helpful responses. This reduces friction by speeding up the process of resolving customer issues and increasing customer satisfaction.

Businesses can benefit from OpenAI-powered chatbots by reducing the workload of customer service agents. With chatbots able to handle routine inquiries, agents can focus on more complex tasks that require human expertise. This results in a more efficient use of resources, allowing businesses to provide better service to their customers while maximizing their operational efficiency.

E-commerce Product Recommendations

In the past, product recommendations were often based on simple algorithms that looked at a customer’s browsing history or purchase history. However, these recommendations were often too simplistic and didn’t take into account a customer’s preferences or interests. With OpenAI, product recommendations can now be based on more complex algorithms that take into account a wider range of data, such as customer reviews and social media activity.

Furthermore, the recent release of ChatGPT plugins has enabled it to browse the web. This means that it can now take into account information from websites and aggregate the result. This allows ChatGPT to provide even more accurate and relevant recommendations to customers. In the future, there is a possibility that generative language models like ChatGPT can present a comprehensive view rather than a biased fragment.

Language Translation Services

In the past, language translation services often relied on machine translation, which had several limitations. Machine translation typically used rule-based algorithms that had difficulty with idiomatic expressions, cultural nuances, and colloquialisms. This often led to translations that were inaccurate or awkward, causing confusion or misunderstandings. In addition, machine translation was often unable to recognize and correct errors in the original text, leading to further inaccuracies in the translation.

However, with the advent of OpenAI and its advanced neural networks, language translation services can now produce more accurate and natural translations. OpenAI’s neural networks are trained on large amounts of data, allowing them to recognize and adapt to a wider range of linguistic features, such as idioms, slang, and regional variations. This makes the translations produced by OpenAI much more accurate and natural-sounding than those produced by traditional machine translation.

Improved translation capabilities can help to make content more widely accessible to people who speak different languages. This can be particularly beneficial for businesses and organizations that operate in multiple countries or regions, as it allows them to reach a wider audience and communicate more effectively with customers or stakeholders who may speak different languages.

Virtual Assistants

In the past, virtual assistants were often limited in their ability to understand and respond to users’ requests due to their reliance on pre-programmed responses. With OpenAI, virtual assistants can now use natural language processing to better understand and respond to users’ requests, reducing the friction of having to repeat requests or navigate through a confusing user interface. This can improve the user experience and increase engagement with the product or service.

With the increasing capabilities of large language models like OpenAI’s GPT, we are likely to see the development of digital assistants that can help us with a wide range of day-to-day tasks. These assistants could use natural language processing to understand our requests and preferences, and then use machine learning algorithms to generate personalized recommendations and solutions.

For example, a digital assistant based on GPT could help us to organize our schedules, book appointments, make travel arrangements, and even order groceries or meals. By reducing the friction associated with these tasks, these digital assistants could help us to save time and increase our overall productivity and efficiency.

Also: ChatGPT Style Guide: Understanding Voice and Tone Prompt Options for Engaging Conversations

Customized Content Creation

Creating high-quality content can be time-consuming and challenging, especially for businesses or individuals without extensive writing experience. With OpenAI, content creation can be made easier by providing users with AI-generated content suggestions or even full articles based on their desired topic or target audience. This is another example of how OpenAI can help reduce friction and streamline tasks, ultimately making it easier for individuals and businesses to succeed in today’s digital landscape.

With OpenAI, businesses and individuals can use AI-generated content to create highly customized products and services that stand out from the competition. For example, a t-shirt company could use AI-generated designs to create truly unique and personalized shirts for their customers. This not only reduces the time and effort required for design work, but also allows for a higher level of customization and uniqueness in the final product. This can help businesses differentiate themselves in crowded markets and appeal to customers looking for something truly one-of-a-kind.

With OpenAI, product recommendations can now be based on more complex algorithms that take into account a wider range of data, such as customer reviews and social media activity.

Summary

The advancements in OpenAI technology will significantly impact the digital landscape by reducing friction between users and products/services. From customer service chatbots and e-commerce product recommendations to language translation services and content creation, OpenAI has provided new and innovative solutions to improve user experience and efficiency. And these are just a few examples. We are just beginning to understand the possibilities of OpenAI GPT to reduce friction in the digital world, and much more innovations can be expected in the coming years.

Things are already moving very fast. With the recent release of OpenAI’s ChatGPT plugins, users can now enjoy the benefits of AI-generated content and web browsing, further expanding the capabilities of digital assistants. As OpenAI continues to develop and refine its technology, we can expect to see even more exciting applications and use cases emerge, ultimately shaping the future of digital interactions.

Get ready to put on your skates! We are just beginning to understand how OpenAI GPT can reduce friction in the digital world.

Sources and Further Reading

TechCrunch – OpenAI connects ChatGPT to the internet.
Images created with Midjourney.
ChatGPT helped to revise this article.

The post Using LLMs (OpenAI’s ChatGPT) to Streamline Digital Experiences appeared first on relataly.com.

9 Business Use Cases of OpenAI’s ChatGPT

Florian Follonier — Sat, 25 Feb 2023 16:30:48 +0000

You haven’t had enough of the recent hype about OpenAI? Fantastic! Let’s explore the business use cases for ChatGPT and other Large Language Models (LLMs). ChatGPT is often considered the beginning of a new era of AI adoption. A growing number of ChatGPT users are now using AI in their personal workflows. However, adoption in a business context is lagging behind. Although decision-makers around the globe are highly interested in using OpenAI’s LLMs, their knowledge of use cases and implementation strategies is often vague. This article aims to offer some guidance on the topic and inspire by sharing insights from recent conversations with decision-makers and industry experts. We present nine Applications for GPT for which we can expect to see increasing adoption in the coming month:

Let’s discuss these cases in more detail.

Also: Mastering Prompt Engineering for ChatGPT for Business Use

chatGPT logo

" data-image-caption="

chatGPT logo

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/chatgpt-logo-02AFA704B5-seeklogo.com_.png" src="https://www.relataly.com/wp-content/uploads/2023/02/chatgpt-logo-02AFA704B5-seeklogo.com_.png" alt="chatGPT logo" width="296" height="300" class="kt-info-box-image wp-image-12485"/>

What is it with OpenAI and ChatGPT?

If you’ve been frozen in Carbonite for the past six months, you may have missed the whole hype on OpenAI and ChatGPT. Let’s bring you up to speed!

ChatGPT is a member of the family of GPT language models, which were developed by OpenAI. GPT stands for “Generative Pretrained Transformers.” These models are known for their ability to understand and generate human-like text. One of the key strengths of ChatGPT is its ability to understand the context of a conversation and generate appropriate responses. Although it’s not perfect and can sometimes make mistakes, it is a very powerful tool for generating and interpreting natural language text.

Top Business Applications for Generative Language Models

The coming month will be exciting as we can expect to see the first wave of OpenAI use cases being implemented. So what can we expect? In the current economic situation, cost reduction is top of mind for many decision-makers. Business efforts will, therefore, likely focus on improving and optimizing the existing business processes rather than exploring new applications.

One of the critical reasons for this approach is that organizations need to gain experience with generative models to lay the foundation for working on more innovative solutions in the future. While traditional machine learning techniques are valuable for many applications, they have limitations when it comes to implementing OpenAI’s generative language models.

OpenAI’s GPT models are a form of general-purpose AI that uses a different approach to machine learning. Implementing these models requires new skills and knowledge. For example, prompt design and rethinking existing processes are crucial for effectively using OpenAI’s capabilities. This means that organizations need to invest in training and develop new expertise to leverage the new technology fully.

Before delving into use cases, it’s important to note that the categories were deliberately chosen to be generic and relevant across various industries. Additionally, to make the use cases as practical as possible, we’ve included examples for prompts and responses. For simplicity, this article doesn’t differentiate between various GPT-3 models but rather focuses on the use cases for ChatGPT and Davinci, which have similar capabilities. The primary distinction between the two models is that ChatGPT keeps track of conversation history, while Davinci treats each request separately.

OpenAI ChatGPT Business Use Cases

" data-image-caption="

OpenAI ChatGPT Business Use Cases

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/OpenAi-business-use-cases-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/OpenAi-business-use-cases-min-1024x730.png" alt="OpenAI ChatGPT Business Use Cases" class="wp-image-12257" srcset="https://www.relataly.com/wp-content/uploads/2023/02/OpenAi-business-use-cases-min.png 1024w, https://www.relataly.com/wp-content/uploads/2023/02/OpenAi-business-use-cases-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/OpenAi-business-use-cases-min.png 768w, https://www.relataly.com/wp-content/uploads/2023/02/OpenAi-business-use-cases-min.png 1111w" sizes="(max-width: 1024px) 100vw, 1024px" />

OpenAI ChatGPT has various business use cases; not all are new. Image Source: Created with Midjourney

1. Text Summarization

Today’s fast-paced business environment confronts organizations with an overwhelming amount of text and documents to process. From customer requests to contracts and product information, employees often spend countless hours sifting through information to understand and extract relevant insights. The inability to efficiently process text-based data can cause delays in critical business processes, leading to reduced productivity and increased costs.

OpenAI’s text summarization capabilities are a game-changer for businesses that need to process large amounts of text. By summarizing dense documents into a few key points or even a single title, businesses can increase process efficiency and improve decision-making.

Text summarization is helpful in various areas. For example, customer service, where customer inquiries via email can be numerous and time-consuming to address. In many cases, only a part of a conversation is relevant to the business, and employees spend a lot of time sifting through customer emails and chat messages. By using OpenAI, customer service representatives can quickly summarize the content of customer emails and identify the key issues that need to be addressed. This form of summarization can lay the foundation for delegating tasks to teams and individuals effectively.

Example of GPT-3 Based Summarization

Prompt
Summarize the following text into three sentences, each with a maximum of 20 words: https://en.wikipedia.org/wiki/Reinforcement_learning

Response
Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. Reinforcement learning is different from supervised learning as there is no labelled input/output pair. It can be applied to various real-world scenarios such as game playing, robotics, recommendation systems, and even in autonomous driving

2. Message Triage and Other Types of Text Classification

The modern work environment is characterized by work sharing and specialization. However, this often leads to tasks landing on someone’s desk who may not be the best person to handle them, leading to delays in completing the task. This is where GPT models can be particularly useful. These models can classify text into categories, allowing organizations to automate the routing of incoming messages to the appropriate person or department. Prediction accuracy will benefit when the models are fine-tuned for a particular domain or industry.

Implementing GPT models for classification can yield significant benefits for organizations, enabling them to save time and resources while ensuring that messages and tasks are directed to the right person or department. For instance, organizations can use generative AI for email triage and prioritization by categorizing incoming messages and assigning them to specialized service teams, such as contract inquiries, relocations, claim-related issues, and general complaints.

In addition, sentiment analysis can be employed to gauge customer satisfaction or dissatisfaction in social media statements. Organizations can start with a semi-automated process, where they use OpenAI to assign labels to text, and a human then makes the final decision on who should handle the task. This can lead to increased efficiency and accuracy, as well as reduced response times and improved customer satisfaction.

tiny_construction_workers_working_on_piles_of_huge_documents relataly midjourney

" data-image-caption="

tiny_construction_workers_working_on_piles_of_huge_documents relataly midjourney

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/tiny_construction_workers_working_on_piles_of_huge_documents-2-min.png" src="https://www.relataly.com/wp-content/uploads/2023/02/tiny_construction_workers_working_on_piles_of_huge_documents-2-min.png" alt="tiny_construction_workers_working_on_piles_of_huge_documents relataly midjourney chatGPT OpenAI GPT-3" class="wp-image-12500" srcset="https://www.relataly.com/wp-content/uploads/2023/02/tiny_construction_workers_working_on_piles_of_huge_documents-2-min.png 761w, https://www.relataly.com/wp-content/uploads/2023/02/tiny_construction_workers_working_on_piles_of_huge_documents-2-min.png 300w" sizes="(max-width: 761px) 100vw, 761px" />

Processing large amounts of documents and text is a major pain point in many organizations. Image created with Midjourney

Example of GPT-3 Based Message Triage

Prompt
Classify the following news article into 1 of the following categories: categories: [Business, Tech, Politics, Sport, Entertainment] news article: Donna Steffensen Is Cooking Up a New Kind of Perfection. The Internet’s most beloved cooking guru has a buzzy new book and a fresh new perspective: Classified category:

Response
Entertainment

3. Completeness of Information Checks

Another challenge that many businesses face when processing natural language is incomplete information. For example, companies receive customer inquiries and orders via email or telephone. Customers may submit requests via email, social media, or telephone, and while it is convenient for them to state their issue in natural language, the messages may lack the necessary information to process their request further. This may include missing product or service information when the customer states his order, address changes where the new address is missing, or inquiries with missing order or ticket numbers. In these cases, a person must first reach out to the customer to collect the missing information and then again check if the information has been completed.

OpenAI can help to identify missing information and streamline information processing. For example, when a customer submits an order request via email, social media, or telephone, OpenAI can analyze the message and identify any missing product or service information, address changes, or order and ticket numbers. This can help customer service teams quickly address the issue and fulfill the request, ultimately improving customer satisfaction and loyalty.

If the AI identifies missing information, it can directly reach out to the customer to request it, enhancing the process efficiency even further. Additionally, combining OpenAI’s capabilities with text classification can streamline the process by automatically categorizing and prioritizing requests.

Example of GPT-3 Based Completeness of Information Checks

Prompt
Determine if the numeric 5-digit claim number is present in the text below. Limit your answer to one of the following categories: [Yes, No] Text: “Awful airline! I have emailed you over 5 times over the past few months regarding my lost bag claim. I’ve included claim number, and details and I simply never get an answer. What a joke.” Classified category:

Response
No.

4. Personalized Communication

Personalized communication is key for businesses to achieve customer experience and satisfaction because it demonstrates that the company values each customer as an individual with unique needs and preferences. By customizing communication, businesses can provide targeted messages and recommendations that are relevant to each customer, making them feel seen and understood. This can lead to increased engagement, loyalty, and improved customer satisfaction. However, creating personalized communication usually takes a lot of time and resources. This is where GPT models can step in.

ChatGPT offers businesses the opportunity to enhance customer satisfaction, engagement, and sales through personalized communication. The model can generate individualized email campaigns, social media posts, and website/app experiences, creating a tailored and unique interaction for each customer.

As businesses begin to implement personalized communication, it may be beneficial to conduct a test phase during which humans review generated messages to ensure accuracy and relevance. Additionally, businesses can provide ChatGPT with additional information on customers, such as past purchases or preferences, to personalize messages further and increase the effectiveness of the communication strategy.

Example of a GPT Prompt for Personalized Communication

Prompt
Write a gentle answer to this customer request: Mr. Jong: I can’t believe I thought this was a quality company – I’m still waiting for my order despite being told over and over again that it would be delivered within 24 hours, only for me to have to chase again. So disappointed, don’t trust this company to fulfil orders as they say they will, and customer service haven’t helped at all.

Response
Dear Mr. Jong,

I am sorry to hear that you have had a negative experience with our company. I can understand your frustration with the delay in receiving your order despite being promised a 24-hour delivery.

Please know that this is not the level of service that we strive to provide our customers. We take pride in ensuring that all our orders are fulfilled promptly and efficiently.

I would like to assure you that we are investigating this matter, and we will take the necessary steps to ensure that this doesn’t happen again in the future.

We value your business and appreciate your feedback. Our customer service team is always available to help and address any concerns you may have. Please don’t hesitate to reach out to us if you need any assistance.

Thank you for bringing this to our attention, and I hope we can earn back your trust and continue to serve you in the future.

Best regards,

[Your Name]

Organizations can use GPT-3 to personalize customer interactions, which is crucial in delivering remarkable experiences. Image generated with Midjourney.

5. Document Comparison

ChatGPT can help businesses to identify differences and similarities between two documents. This feature can be particularly useful in situations where businesses need to ensure the authenticity of important documents or when multiple collaborators work on the same document. For instance, a law firm can use ChatGPT to compare two versions of a contract and highlight any discrepancies or unusual changes. Additionally, ChatGPT can group similar documents together and identify instances of plagiarism, redundant information, or conflicting statements, which can save time and improve accuracy.

Areas, where document comparison with GPT models is particularly useful, include underwriting, purchasing, and contract management, especially in legal and contractual matters such as comparing versions of contracts, identifying unauthorized changes or unusual deviations from the norm, and ensuring compliance with regulations.

For example, in the insurance industry, underwriters can use ChatGPT to compare policy documents and ensure that there are no inconsistencies or errors in coverage. Purchasers can use ChatGPT to compare supplier contracts and ensure that prices, terms, and conditions are consistent across multiple agreements. In contract management, ChatGPT can help businesses track changes made to a contract over time and ensure that all parties are adhering to the agreed-upon terms. By utilizing ChatGPT for document comparison, businesses can maintain document integrity and improve their overall efficiency.

Example of a Prompt for Document Comparison

Prompt
Compare the following two privacy statements and explain the difference:
Text 1: This website uses cookies, as do many other websites. Cookies are used to store information about visitors, including their preferences and which pages they have accessed or visited on the website. This information is used to optimize the user experience by customizing the content of our web pages based on the visitor’s browser type and other information.
Text 2: Like any other website, this website uses ‘cookies.’ These cookies are used to store information, including visitors’ preferences and the pages on the website that the visitor accessed or visited. The information is used to optimize the users’ experience by customizing our web page content based on visitors’ browser type and/or other information.
We will definitely sell your information.

Response
There is a clear difference between the two privacy statements. Text 1 does not mention anything about selling the visitor’s information, while Text 2 includes the sentence “We will definitely sell your information.” This indicates that the website may not have ethical or lawful practices when it comes to handling visitor information, and it could potentially violate privacy laws. Additionally, Text 1 appears to be more professional and straightforward, while Text 2 includes a grammatical error in the first sentence. However, both texts mention the use of cookies to store visitor information and optimize the user experience.

Generative AI can compare documents and spot differences or deviations from the norm.

6. Reasoning and Question Answering over a Knowledge Base

OpenAI’s advanced machine learning models, such as the GPT series, have proven to be powerful tools for analyzing complex information. These can be either structured or unstructured data in individual documents or a comprehensive knowledge base.

Take financial data, for example. Balance statements can be a nightmare to interpret manually, but with OpenAI’s natural language processing, companies can automatically pinpoint trends and crucial information, making it easier to digest and act on financial data. And that’s just the beginning. By enhancing their solutions with techniques for summarization or advanced prompt engineering, organizations can generate complete reports with tables, further simplifying the digestion of complex information.

But OpenAI’s power extends beyond finance. In IoT (Internet of Things), analyzing sensor data can be a mind-bending challenge. OpenAI can help companies make sense of this data in real-time, identifying patterns and anomalies that might have otherwise gone unnoticed. This can be especially valuable in manufacturing, where even minor sensor data variations can significantly impact product quality and efficiency.

Also: Predictive Maintenance: Predicting Machine Failure using Sensor Data with XGBoost and Python

In short, OpenAI’s ability to analyze and reason over complex information is nothing short of revolutionary. With this technology, companies can make faster, more informed decisions based on large volumes of data. As technology continues to improve and evolve, the possibilities for AI are limitless. We can expect to see organizations using AI to interpret and analyze even more complex information in the years ahead, paving the way for a brighter, smarter future.

Sample Prompt for Reasoning over Structured and Unstructured Data

Prompt
Assess the financial situation of an enterprise with the following income statement: Income Statement All numbers in thousands Date 9/29/2022 9/29/2021 Total Revenue 387,537,000 394,328,000 Cost of Revenue 220,666,000 223,546,000 Gross Profit 166,871,000 170,782,000 Operating Expense 52,906,000 51,345,000 Operating Income 113,965,000 119,437,000 Total Operating Income as Reported 113,965,000 119,437,000 Total Expenses 273,572,000 274,891,000 Interest Income 3,043,000 2,825,000 Interest Expense 3,240,000 2,931,000 EBIT 116,725,000 122,034,000

Response
To assess the financial situation of the enterprise, we need to analyze the income statement and calculate the key financial ratios.
First, let’s look at the income statement. We can see that the company’s revenue has decreased slightly from 2021 to 2022. However, the cost of revenue has also decreased, resulting in a stable gross profit. Operating expenses have increased slightly, but the operating income remains relatively stable. Interest income has increased slightly, but interest expenses have also increased.

Now let’s calculate some financial ratios:

Gross Profit Margin = (Gross Profit / Total Revenue) x 100
= (166,871,000 / 387,537,000) x 100
= 43.0%
The gross profit margin measures how much profit the company makes after accounting for the cost of goods sold. A gross profit margin of 43.0% indicates that the company is able to generate a decent profit from its sales.

Operating Profit Margin = (Operating Income / Total Revenue) x 100
= (113,965,000 / 387,537,000) x 100
= 29.4%
The operating profit margin measures how much profit the company makes after accounting for all operating expenses. A margin of 29.4% indicates that the company is able to generate a good amount of profit from its operations.

…

GPT-3.5 can reason over complex data like no other AI before.

7. Query Generation

OpenAI ChatGPT can also assist businesses with SQL query generation. This is especially useful for non-technical users, who may not be familiar with SQL syntax or the structure of the database. With ChatGPT, users can simply ask natural language questions about the data they are interested in, and the model can generate the appropriate SQL queries to extract the relevant information.

This has the potential to greatly expand the user base that can directly interact with a database. Additionally, more advanced solutions can provide question-answering functionality based on the data residing in data warehouses (DWH). This can help businesses quickly extract insights and make data-driven decisions without the need for extensive technical knowledge. More advanced solutions will be able to provide question-answering functionality based on the data residing in a DWH, that are often perceived as black boxes by non-tech guys. This will expand the user base that can directly interact with a database and leverage the existing information.

Sample Prompt for SQL Query Generation

Prompt
Tables in the database: Customers, Purchases, Addresses. Tables are linked through a customer_id, which is present in all tables. Given an input question, respond with syntactically correct PostgreSQL. Input: Get the names and addresses of customers who made a purchase in the past week.

Response
Here’s the PostgreSQL query to get the names and addresses of customers who made a purchase in the past week: SELECT c.name, a.address FROM Customers c JOIN Purchases p ON c.customer_id = p.customer_id JOIN Addresses a ON c.customer_id = a.customer_id WHERE p.purchase_date > now() – interval ‘1 week’; This query uses a JOIN operation to link the Customers, Purchases, and Addresses tables together using the customer_id field. It selects the name and address columns from the Customers and Addresses tables, respectively, and filters the results to include only purchases made in the past week using the WHERE clause with a date comparison.

If ChatGPT can generate SQL queries, why not use these queries to return a result to the user directly?

8. Language Translation

Another powerful feature is ChatGPT’s language translation capacity. It can translate text in near real-time, allowing businesses to communicate with customers in their preferred language. While language translation is nothing but new, what sets ChatGPT apart is its ability to seamlessly integrate with any process by simply adding a “translate to” command to a prompt. As a result, ChatGPT has the potential to significantly lower language barriers and facilitate communication across international teams and channels.

By using ChatGPT for language translation, businesses can enhance their reach and improve customer satisfaction. ChatGPT’s language translation capabilities can be used to translate various types of content, such as customer service interactions, website content, and other communication channels. What’s even more impressive is that ChatGPT supports regional dialects like Swiss-German, making it a versatile tool for organizations operating in multilingual regions.

Example of GPT-3 Based Language Translation

Prompt
Translate to Swiss-German: ChatGPT can translate text in real-time, making it possible to communicate with customers in their preferred language. This can help businesses expand their reach and improve customer satisfaction. ChatGPT’s language translation capabilities are highly accurate and can be used to translate customer service interactions, website content, and other types of communication. By using ChatGPT for language translation, businesses can improve their ability to communicate with a global audience.

Response
ChatGPT cha verwendet werde, um Text in Echtzeit z’übersetze und ermöglicht so d’Kommunikation mit Kunden in ihre bevorzugte Sprach. Das cha Unternehme helfe, ihre Reichwiti z’vergrössere und d’Kundenzufriedeheit z’verbessere. ChatGPT’s Sprachübersetzigsfähigkeit isch hochpräzis und cha verwendt werde, um Kundeservice-Interaktione, Website-Inhalte und ander Arten vo Kommunikation z’übersetze. Durch d’Verwendig vo ChatGPT für d’Sprachübersetzig cha d’Kommunikationsfähigkeit vo Unternehme mit ere globale Zielgruppe verbessert werde.

9. Content Generation

Content generation is a critical aspect of modern business. It is vital in attracting and engaging customers, building brand awareness, and establishing thought leadership in a particular industry. As such, businesses invest significant resources in creating high-quality content that can capture the attention of their target audience. Such content ranges from blog posts and social media updates to product descriptions and website copy.

However, creating compelling and engaging content is not always easy. Businesses need to produce a large volume of content while ensuring that it is high-quality and relevant to their target audience. This is where OpenAI ChatGPT can assist.

ChatGPT can assist businesses with content generation in a variety of ways. For instance, it can generate unique and engaging content based on specific topics or keywords provided by the user, saving businesses time and resources. Moreover, ChatGPT can be trained to understand a business’s unique tone and voice. In this way, businesses can ensure that the generated content aligns with their brand and values.

Another way that ChatGPT can assist with content generation is by automating the content curation process. By analyzing and organizing vast amounts of data from various sources, including social media, blogs, news articles, and research papers, ChatGPT can generate curated content that is highly relevant and valuable to a business’s target audience. Some examples are product recommendations, email campaigns, or chatbot interactions that are tailored to each customer’s preferences and needs.

Example of GPT-3 Based Content Generation

Prompt
Generate three creative advertisement slogans for Cereals for the Japanese market. Be as creative as possible. Put in some Pokemon references.

Response
“Gotta catch ’em all, starting with your breakfast bowl! Our cereals are the perfect fuel for your daily adventures!”
“Our cereals are so delicious, even Pikachu would choose them over berries!”
“Power up your mornings with our cereals, just like how trainers power up their Pokemon with berries!”

GPT-3 is great at content creation and can also come up with really creative content. Image created with Midjourney

Conclusion

This article has outlined nine business applications that are likely to be the focus of the first wave of OpenAI adoption. While some of these applications, such as sentiment analysis, are not entirely new and can be accomplished with traditional NLP models, others like content generation are groundbreaking and were not previously possible.

The hype surrounding GPT models stems from their combination of versatility and performance. Unlike traditional models, generative AI can handle a variety of tasks simultaneously without requiring extensive training data. As a result, businesses can leverage generative AI to develop new use cases, as well as upgrade their existing processes and become more agile by outsourcing some of their decision logic to generative AI.

It will be fascinating to witness the development of generative AI and how it will transform how we work and communicate in the future. The emergence of generative AI is often compared to the introduction of the iPhone, and just as the iPhone has evolved over time, so too will generative AI. If we think of GPT-3 as the third generation of the iPhone, we can only imagine what the future holds.

Sources and Further Reading

Reuter.com/chatgpt-sets-record-fastest-growing-user-base-analyst/
OpenAI.com/prompt-design
ChatGPT helped to revise some sections of this article.
Images are generated with Midjourney, an AI that generates images from text.
Performing traditional sentiment analysis using Python

Articles on the practical implementation of OpenAI GPT-3 models via the public API

The post 9 Business Use Cases of OpenAI’s ChatGPT appeared first on relataly.com.

Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python

Florian Follonier — Mon, 25 Jul 2022 11:29:00 +0000

Content-based recommender systems are a popular type of machine learning algorithm that recommends relevant articles based on what a user has previously consumed or liked. This approach aims to identify items with certain keywords, understand what the customer likes, and then identify other items that are similar to items the user has previously consumed or rated. The recommendations are based on the similarity of the items, represented by similarity scores in a vector matrix. The attributes used to describe an item are called “content.” For example, in the case of movie recommendations, content could be the genre, actors, director, year of release, etc. A well-designed content-based recommendation service will suggest movies of the same genre, actors, or keywords. This tutorial will implement a content-based recommendation service for movies using Python and Scikit-learn.

The rest of this tutorial proceeds as follows: After a brief introduction to content-based recommenders, we will work with a database that contains several thousands of IMDB movie titles and create a feature model that uses actors, release year, and a short description for each movie. In this tutorial, you will also learn how to deal with some challenges of building a content-based recommender. For example, we will look at how we can engineer features for content-based model words and reduce the dimensionality of our model. Finally, we use our model to generate some sample predictions.

Note: Another popular type of recommender system that I have covered in a previous article is collaborative filtering.

Recommendation systems can ease decision-making. Image created with Midjourney.

What is Content-Based Filtering?

The idea behind content-based recommenders is to generate recommendations based on user’s preferences and tastes. These preferences revolve around past user choices, for example, the number of times a user has watched a movie, purchased an item, or clicked on a link.

Content-based filtering uses domain-specific item features to measure the similarity between items. Given the user preferences, the algorithm will recommend items similar to what the user has consumed or liked before. For movie recommendations, this content can be the genre, actors, release year, director, film length, or keywords used to describe the movies. This approach works particularly well for domains with a lot of textual metadata, such as movies and videos, books, or products.

Content-based movie recommendations will suggest more of the same, for example, actors, genres, stories, and directors.

Basic Steps to Building a Content-based Recommender System

The approach to building a content-based recommender involves four essential steps:

The first step is to create a so-called ‘bag of words’ model from the input data, which is a list of words used to characterize the items. This step involves selecting useful content for describing and differentiating the items. The more precise the information, the better the recommendations will be.
The next step is to turn the bag (of words) into a feature vector. Different algorithms can be used for this step, for example, the Tfdif vectorizer or the count vectorizer. The result is a vector matrix with items as records and features as columns. This step often also includes applying techniques for dimensionality reduction.
The idea of content-based recommendations is based on measuring item similarity. Similarity scores are assigned through pairwise comparison. Here again, we can choose between different measures, e.g., the dot product or cosine similarity.
Once you have the similarity scores, you can return the most similar items by sorting the data by similarity scores. Given user preferences (single or multiple items a user consumed or liked), the algorithm will then recommend the most similar items.

Approach to Building a Content-based Recommender System

Similarity Scoring

The quality of the content-based recommendations is significantly influenced by how well the algorithm succeeds in measuring the similarity of the items. There are different techniques to calculate similarity, including Cosine Similarity, Pearson Similarity, Dot Product, and Euclidian Distance. They have in common that they use numerical characteristics of the text to calculate the distance between text vectors in an n-dimensional vector space.

It is worth denoting that these techniques can only measure word-level similarity. This means the algorithms compare the word of the item for word without considering the semantic meaning of the sentences. In some instances, this can lead to errors. For example, how similar are “now that they were sitting on a bank, he noticed she stole his heart, and he was in love” and “They are gangsters who love to steal from a large bank”? By just looking at the words, one may appear similar because the words have a good overlap.

Pros and Cons of Content-based Filtering

Like most machine learning algorithms, content-based recommenders have their strength and weaknesses.

Advantages

Content-based filtering is good at capturing a user’s specific interests and will recommend more of the same (for example, genre, actors, directors, etc.). It will also recommend niche items if they match the user preferences, even if these items draw little attention.
Another advantage is that the model can generate recommendations for a specific user without the knowledge of other users. This is particularly helpful if you want to generate predictions for many users.

Disadvantages

On the other hand, there are also a couple of downsides. The feature representation of the items has to be done manually to a certain extent, and the prediction quality strongly depends on whether items are described in detail. Therefore, content-based filtering requires a lot of expertise.
Since recommendations are based on the user’s previous interests. However, the recommendations are unlikely to go beyond that and expand to areas (e.g., genres) that are still unknown to the user. Content-based models thus tend to develop some tunnel vision, so that the model recommends more and more of the same.

Implementing a Content-based Movie Recommender in Python

In the following, we will implement a content-based movie recommender using Python and Scikit-learn. We will carry out all steps necessary to create a content-based recommender. The data comes from an IMDB dataset containing more than 40k films between 1996 and 2018. Based on the data, we define the features we want to use for recommending the movies. These features include the genre, director, main actors, plot keywords, or other metadata associated with the movies. Then we preprocess the data to extract these features and create a feature matrix. The feature matrix becomes the foundation for a similarity matrix that measures the similarity between the items based on their feature vectors. Finally, we use the similarity matrix to generate recommendations for a given item.

By the end of this Python tutorial, you will have learned how to implement a content-based recommendation system for movies using Python and Scikit-learn. This knowledge can be applied to other types of recommendations, such as articles, products, or songs.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

This Robot doesn’t know what to watch on tv. Let’s build a recommender system for him! Image generated using DALL-E 2 by OpenAI.

Prerequisites

Before you start with the coding part, ensure you have set up your Python 3 environment and required packages. If you don’t have an environment, consider the Anaconda Python environment. Follow this tutorial to set it up.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using Seaborn for visualization and the natural language processing library nltk.

You can install these packages by using one of the following commands:

pip install
conda install (if you are using the anaconda packet manager)

About the IMDB Movies Dataset

We will train our movie recommender on a popular Movies Dataset (you can download it from grouplens.org). The MovieLens recommendation service collected the Dataset from 610 users between 1996 and 2018. Unpack the data into the working folder of your project.

The full Dataset contains metadata on over 45,000 movies and 26 million ratings from over 270,000 users. The Dataset contains the following files (Source of the data description: Kaggle.com):

movies_metadata.csv: The main Movies Metadata file contains information on 45,000 movies featured in the Full MovieLens Dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries, and companies.
ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies. Each line corresponds to a 5-star movie rating with half-star increments (0.5 – 5.0 stars).
keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.
credits.csv: Consists of Cast and Crew Information for all our films. Available in the form of a stringified JSON Object.

recomender systems collaborative filtering imdb movies

" data-image-caption="

recomender systems collaborative filtering imdb movies

" data-large-file="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" src="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png" alt="MDB Movie Database Recommender Systems Collaborative Filtering " class="wp-image-7128" width="366" height="192" srcset="https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 1024w, https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 300w, https://www.relataly.com/wp-content/uploads/2022/04/MDB-Movie-Database.png 768w" sizes="(max-width: 366px) 100vw, 366px" />

IMDB Movie Database

Several other files are included that we won’t use, incl. ratings_small, links_small, and links.

You can download it here or from Kaggle.

Step #1: Load the Data

Our goal is to create a content-based recommender system for movie recommendations. In this case, the content will be meta information on movies, such as genre, actors, the description.

We begin by making imports and loading the data from three files:

movies_metadata.csv
credits.csv
keywords.csv

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from nltk.corpus import stopwords

# the IMDB movies data is available on Kaggle.com
# https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset

# in case you have placed the files outside of your working directory, you need to specify the path
path = 'data/movie_recommendations/' 

# load the movie metadata
df_meta=pd.read_csv(path + 'movies_metadata.csv', low_memory=False, encoding='UTF-8') 

# some records have invalid ids, which is why we remove them
df_meta = df_meta.drop([19730, 29503, 35587])

# convert the id to type int and set id as index
df_meta = df_meta.set_index(df_meta['id'].str.strip().replace(',','').astype(int))
pd.set_option('display.max_colwidth', 20)
df_meta.head(2)

		adult	belongs_to_collection			budget		genres				homepage			id		imdb_id		original_language	original_title	overview			...	release_date	revenue		runtime	spoken_languages	status		tagline	title	video	vote_average	vote_count
id																					
862		False	{'id': 10194, 'n...				30000000	[{'id': 16, 'nam...	http://toystory....	862		tt0114709	en					Toy Story		Led by Woody, An...	...	1995-10-30		373554033.0	81.0	[{'iso_639_1': '...	Released	NaN	Toy Story	False	7.7	5415.0
8844	False	NaN								65000000	[{'id': 12, 'nam...	NaN					8844	tt0113497	en				Jumanji				When siblings Ju...	...	1995-12-15		262797249.0	104.0	[{'iso_639_1': '...	Released	Roll the dice an...	Jumanji	False	6.9	2413.0

After we have loaded credits and keywords, we will combine the data into a single dataframe. Now we have various input fields available. However, we will only use keywords, cast, year of release, genres, and overview. If you like, you can enhance the data with additional inputs, for example, budget, running time, or film language.

Once we have gathered our data in a single dataframe, we print out the first rows to gain an overview of the data.

# load the movie credits
df_credits = pd.read_csv(path + 'credits.csv', encoding='UTF-8')
df_credits = df_credits.set_index('id')

# load the movie keywords
df_keywords=pd.read_csv(path + 'keywords.csv', low_memory=False, encoding='UTF-8') 
df_keywords = df_keywords.set_index('id')

# merge everything into a single dataframe 
df_k_c = df_keywords.merge(df_credits, left_index=True, right_on='id')
df = df_k_c.merge(df_meta[['release_date','genres','overview','title']], left_index=True, right_on='id')
df.head(3)

		keywords			cast				crew				release_date	genres				overview			title
id							
862		[{'id': 931, 'na...	[{'cast_id': 14,...	[{'credit_id': '...	1995-10-30		[{'id': 16, 'nam...	Led by Woody, An...	Toy Story
8844	[{'id': 10090, '...	[{'cast_id': 1, ...	[{'credit_id': '...	1995-12-15		[{'id': 12, 'nam...	When siblings Ju...	Jumanji
15602	[{'id': 1495, 'n...	[{'cast_id': 2, ...	[{'credit_id': '...	1995-12-22		[{'id': 10749, '...	A family wedding...	Grumpier Old Men

We can see cast, crew, and genres have a dictionary-like structure. To create a cosine similarity matrix, we need to extract the keywords from these columns and gather them in a single column. This is what we will do in the next step.

Step #2: Feature Engineering and Data Cleaning

A problem with modeling text is that machine learning algorithms have difficulty processing text directly. An essential step in creating content-based recommenders is bringing the text into a machine-readable form. This is what we call feature engineering.

2.1 Creating a Bag-of-Words Model

We begin with feature engineering and creating the bag of words. As mentioned, a bag of words is a list of words relevant to describe items in a dataset, such as films, and differentiate them. Creating a bag of words removes stopwords but preserves multiplicity so that words can occur multiple times in the concatenated text. Later, each word can be used as a feature in calculating cosine similarities.

The input for a bag of words does not necessarily come from a single input column. We will use keywords, genres, cast, and overview and merge them into a new single column that we call tags. Make sure to capture the text field’s nature. We will keep names and surnames together and not split them, as we will do with the words from the overview column. The result of this process is our bag.

In addition, we add the movie title and a new index (id), which will later ease working with the similarity matrix. Finally, we print the first rows of our feature dataframe.

# create an empty DataFrame
df_movies = pd.DataFrame()

# extract the keywords
df_movies['keywords'] = df['keywords'].apply(lambda x: [i['name'] for i in eval(x)])
df_movies['keywords'] = df_movies['keywords'].apply(lambda x: ' '.join([i.replace(" ", "") for i in x]))

# extract the overview
df_movies['overview'] = df['overview'].fillna('')

# extract the release year 
df_movies['release_date'] = pd.to_datetime(df['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

# extract the actors
df_movies['cast'] = df['cast'].apply(lambda x: [i['name'] for i in eval(x)])
df_movies['cast'] = df_movies['cast'].apply(lambda x: ' '.join([i.replace(" ", "") for i in x]))

# extract genres
df_movies['genres'] = df['genres'].apply(lambda x: [i['name'] for i in eval(x)])
df_movies['genres'] = df_movies['genres'].apply(lambda x: ' '.join([i.replace(" ", "") for i in x]))

# add the title
df_movies['title'] = df['title']

# merge fields into a tag field
df_movies['tags'] = df_movies['keywords'] + df_movies['cast']+' '+df_movies['genres']+' '+df_movies['release_date']

# drop records with empty tags and dublicates
df_movies.drop(df_movies[df_movies['tags']==''].index, inplace=True)
df_movies.drop_duplicates(inplace=True)

# add a fresh index to the dataframe, which we will later use when refering to items in a vector matrix
df_movies['new_id'] = range(0, len(df_movies))

# Reduce the data to relevant columns
df_movies = df_movies[['new_id', 'title', 'tags']]

# display the data
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.expand_frame_repr', False)
print(df_movies.shape)
df_movies.head(5)

		new_id	title							tags
id			
862		0		Toy Story						jealousy toy boy friendship friends rivalry boynextdoor newtoy toycomestolifeTomHanks TimAllen DonRickles JimVarney WallaceShawn JohnRatzenberger AnniePotts JohnMorris ErikvonDetten LaurieMetcalf R.LeeErmey SarahFreeman PennJillette Animation Comedy Family 1995
8844	1		Jumanji							boardgame disappearance basedonchildren'sbook newhome recluse giantinsectRobinWilliams JonathanHyde KirstenDunst BradleyPierce BonnieHunt BebeNeuwirth DavidAlanGrier PatriciaClarkson AdamHann-Byrd LauraBellBundy JamesHandy GillianBarber BrandonObray CyrusThiedeke GaryJosephThorup LeonardZola LloydBerry MalcolmStewart AnnabelKershaw DarrylHenriques RobynDriscoll PeterBryant SarahGilson FloricaVlad JuneLion BrendaLockmuller Adventure Fantasy Family 1995
15602	2		Grumpier Old Men				fishing bestfriend duringcreditsstinger oldmenWalterMatthau JackLemmon Ann-Margret SophiaLoren DarylHannah BurgessMeredith KevinPollak Romance Comedy 1995
31357	3		Waiting to Exhale				basedonnovel interracialrelationship singlemother divorce chickflickWhitneyHouston AngelaBassett LorettaDevine LelaRochon GregoryHines DennisHaysbert MichaelBeach MykeltiWilliamson LamontJohnson WesleySnipes Comedy Drama Romance 1995
11862	4		Father of the Bride Part II		baby midlifecrisis confidence aging daughter motherdaughterrelationship pregnancy contraception gynecologistSteveMartin DianeKeaton MartinShort KimberlyWilliams-Paisley GeorgeNewbern KieranCulkin BDWong PeterMichaelGoetz KateMcGregor-Stewart JaneAdams EugeneLevy LoriAlan Comedy 1995

2.2 Visualizing Text Length

We can use a bar chart to illustrate each movie’s word bag length. This gives us an idea of how detailed the movie descriptions are. Items with short descriptions have, in principle, a lower probability of being recommended later. Recommenders produce better results if the length of the descriptions is somewhat balanced.

# add the tag length to the movies df
df_movies['tag_len'] = df_movies['tags'].apply(lambda x: len(x))

# illustrate the tag text length
sns.displot(data=df_movies.dropna(), bins=list(range(0, 2000, 25)), height=5, x='tag_len', aspect=3, kde=True)
plt.title('Distribution of tag text length')
plt.xlim([0, 2500])

Step #3: Vectorization using TfidfVectorizer

The next step is to create a vector matrix from the Bag of Words model. Each column from the matrix represents a word feature. This step is the basis for determining the similarity of the movies afterward. Before the vectorization, we will remove stop words from the text (e.g., and, it, that, or, why, where, etc.). In addition, I limited the number of features in the matrix to 5000 to reduce training time.

A simple vectorization approach is to determine the word frequency for each movie using a count vectorizer. However, a frequently mentioned disadvantage of this approach is that it does not consider how often a word occurs. For example, some words may appear in almost all items. On the other hand, some words may be prevalent in a few items but are rare in general. So we can argue that observing rare words in an item is more informative than observing common words. Instead of a count vectorizer, we will use a more practical approach called TfidfVectorizer from the scikit-learn package.

Tfidf stands for term frequency-inverse document frequency. Compared to a count vectorizer, the tf-idf vectorizer considers the overall word frequencies and weights the general importance of the words when spanning the vectors. This way, tf-idf can determine which words are more important than others, reducing the model’s complexity and improving performance. This medium article explains the math behind tf-idf vectorization in more detail.

# set a custom stop list from nltk
stop = list(stopwords.words('english'))

# create the tfid vectorizer, alternatively you can also use countVectorizer
tfidf =  TfidfVectorizer(max_features=5000, analyzer = 'word', stop_words=set(stop))
vectorized_data = tfidf.fit_transform(df_movies['tags'])
count_matrix = pd.DataFrame(vectorized_data.toarray(), index=df_movies['tags'].index.tolist())
print(count_matrix)

			0     1     2     3     4     5     6     7     8     9     ...  4990  4991  4992  4993  4994  4995  4996  4997  4998  4999
862      	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
8844     	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
15602    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
31357    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
11862    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
...      	...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...
439050   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
111109   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
67758    	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
227506   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
461257   	0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0

[45432 rows x 5000 columns]

The vectorization process results in a feature matrix in which each feature is a word from the text bag of words.

We can display features with the get_feature_names_out function from the tfidf vectorizer.

# print feature names
print(tfidf.get_feature_names_out()[940:990])

['climbing' 'clinteastwood' 'clinthoward' 'clive' 'cliveowen'
 'cliverevill' 'cliverussell' 'clone' 'clorisleachman' 'cloviscornillac'
 'clown' 'clugulager' 'clydekusatsu' 'co' 'coach' 'cobb' 'cocaine' 'code'
 'coffin' 'cohen' 'coldwar' 'cole' 'colehauser' 'coleman' 'colinfarrell'
 'colinfirth' 'colinhanks' 'colinkenny' 'colinsalmon' 'colleencamp'
 'college' 'colmfeore' 'colmmeaney' 'coma' 'combat' 'comedian' 'comedy'
 'comicbook' 'comingofage' 'comingout' 'common' 'communism' 'communist'
 'company' 'competition' 'composer' 'computer' 'con' 'concentrationcamp'
 'concert']

As you can see, features are specific words,

Step #4 Dimensionality Reduction and Calculate Consine Similarities

In the previous section, we created a vector matrix that contains movies and features. This matrix is the foundation for calculating similarity scores for all movies. Before we assign feature scores, we will apply dimensionality reduction.

4.1 Dimensionality Reduction using SVD

The matrix spans a high-dimensional vector space with more than 5000 feature columns. Do we need all of these features? The answer is most likely not. There are likely a lot of words in the matrix that only occur once or twice. On the other hand, words may occur in almost all movies. How can we deal with this issue?

The reason for this is that we have a very dimensional vector space. By reducing this space to fewer, more essential features, we can save some time training our recommender model. We will use TruncatedSVD from the scikit-learn package, a popular algorithm for dimensionality reduction. The algorithm smoothens the matrix and approximates it to a lower dimensional space, thereby reducing noise and model complexity.

This way, we will reduce the vector space from 5000 to 3000 features.

# reduce dimensionality for improved performance
svd = TruncatedSVD(n_components=3000)
reduced_data = svd.fit_transform(count_matrix)

4.2 Calculate Text Similarity Scores for all Movies

Now that we have reduced the complexity of our vector matrix, we can calculate the similarity scores for all movies. In this process, we assign a similarity score to all item pairs that measure content closeness according to the position of the items in the vector space.

We use the cosine function to calculate the similarity value of the movies. The cosine similarity is a mathematical calculation to determine the mathematical similarity of two vectors. In our case, the vectors are the movie descriptions. The cosine similarity function uses these feature vectors to compare each movie to every other and assigns them a similarity value.

A similarity value of -1 means that two feature vectors are correlated, and the movies are entirely different.
A value of 1 means that the two movies are identical.
A value of 0 is between and means f an average match of the feature vectors.

The cosine similarity function will calculate pairwise similarities for all movies in our vector matrix. We can determine the number of pairwise comparisons with the formula k²/2, whereby k is the number of items in the vector matrix. In our case, we have a k of 45000 movies. This means the cosine similarity function must calculate about 1 billion similarity scores. So don’t worry if the process takes some time to complete.

# compute the cosine similarity matrix
similarity = cosine_similarity(reduced_data)
similarity

array([[ 1.00000000e+00,  9.75542082e-02,  6.00755620e-02, ...,
        -3.03965235e-04,  0.00000000e+00,  5.81243547e-05],
       [ 9.75542082e-02,  1.00000000e+00,  5.92929339e-02, ...,
        -2.97565163e-03,  0.00000000e+00,  4.57945869e-05],
       [ 6.00755620e-02,  5.92929339e-02,  1.00000000e+00, ...,
         9.40459504e-03,  0.00000000e+00, -2.22415551e-04],
       ...,
       [-3.03965235e-04, -2.97565163e-03,  9.40459504e-03, ...,
         1.00000000e+00,  0.00000000e+00, -2.60823346e-04],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 5.81243547e-05,  4.57945869e-05, -2.22415551e-04, ...,
        -2.60823346e-04,  0.00000000e+00,  1.00000000e+00]])

Step #5: Generate Content-based Movie Recommendations

Once you have created the similarity matrix, it’s time to generate some recommendations. We begin by generating recommendations based on a single movie. In the cosine similarity matrix, the most similar movies have the highest similarity scores. Once we have the film with the highest scores, we can visualize the results in a bar chart that shows the cosine similarity scores.

The example below displays the results of the movie “The Matrix.” Oh, how I love this movie 🙂

# create a function that takes in movie title as input and returns a list of the most similar movies
def get_recommendations(title, n, cosine_sim=similarity):
    
    # get the index of the movie that matches the title
    movie_index = df_movies[df_movies.title==title].new_id.values[0]
    print(movie_index, title)
    
    # get the pairwsie similarity scores of all movies with that movie and sort the movies based on the similarity scores
    sim_scores_all = sorted(list(enumerate(cosine_sim[movie_index])), key=lambda x: x[1], reverse=True)
    
    # checks if recommendations are limited
    if n > 0:
        sim_scores_all = sim_scores_all[1:n+1]
        
    # get the movie indices of the top similar movies
    movie_indices = [i[0] for i in sim_scores_all]
    scores = [i[1] for i in sim_scores_all]
    
    # return the top n most similar movies from the movies df
    top_titles_df = pd.DataFrame(df_movies.iloc[movie_indices]['title'])
    top_titles_df['sim_scores'] = scores
    top_titles_df['ranking'] = range(1, len(top_titles_df) + 1)
    
    return top_titles_df, sim_scores_all

# generate a list of recommendations for a specific movie title
movie_name = 'The Matrix'
number_of_recommendations = 15
top_titles_df, _ = get_recommendations(movie_name, number_of_recommendations)
 
# visualize the results
def show_results(movie_name, top_titles_df):
    fix, ax = plt.subplots(figsize=(11, 5))
    sns.barplot(data=top_titles_df, y='title', x= 'sim_scores', color='blue')
    plt.xlim((0,1))
    plt.title(f'Top 15 recommendations for {movie_name}')
    pct_values = ['{:.2}'.format(elm) for elm in list(top_titles_df['sim_scores'])]
    ax.bar_label(container=ax.containers[0], labels=pct_values, size=12)

show_results(movie_name, top_titles_df)

Example for the movies “Spectre” and “The Lion King”

Step #6: Generate Content-based Movie Recommendations

But what if you want to generate recommendations for specific users that have seen several movies? For this, we can aggregate the similarity scores for all films the user has seen. This way, we create a new dataframe that sums up similarity scores. To return the top-recommended movies, we can sort this dataframe by similarity scores and replace the top elements.

# list of movies a user has seen
movie_list = ['The Lion King', 'Seven', 'RoboCop 3', 'Blade Runner', 'Quantum of Solace', 'Casino Royale', 'Skyfall']

# create a copy of the movie dataframe and add a column in which we aggregated the scores
user_scores = pd.DataFrame(df_movies['title'])
user_scores['sim_scores'] = 0.0

# top number of scores to be considered for each movie
number_of_recommendations = 10000
for movie_name in movie_list:
    top_titles_df, _ = get_recommendations(movie_name, number_of_recommendations)
    # aggregate the scores
    user_scores = pd.concat([user_scores, top_titles_df[['title', 'sim_scores']]]).groupby(['title'], as_index=False).sum({'sim_scores'})
# sort and print the aggregated scores
user_scores.sort_values(by='sim_scores', ascending=False)[1:20]

Summary

In this tutorial, you have learned to implement a simple content-based recommender system for movie recommendations in Python. We have used several movie-specific details to calculate a similarity matrix for all movies in our dataset. Finally, we have used this model to generate recommendations for two cases:

Films that are similar to a specific movie
Films that are recommended based on the watchlist of a particular user.

A downside of content-based recommenders is that you cannot test their performance unless you know how users perceived the recommendations. This is because content-based recommenders can only determine which items in a dataset are similar. To understand how well the suggestions work, you must include additional data about actual user preferences.

More advanced recommenders will combine content-based recommendations with user-item interactions (e.g., collaborative filtering). Such models are called hybrid recommenders, but this is something for another article.

Image created with Midjourney

Sources and Further Reading

Below are some resources for further reading on recommender systems and content-based models.

Books

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Articles

The post Create a Personalized Movie Recommendation Engine using Content-based Filtering in Python appeared first on relataly.com.

Build a High-Performing Movie Recommender System using Collaborative Filtering in Python

Florian Follonier — Mon, 31 May 2021 10:29:35 +0000

The digital age presents us with an unmanageable number of decisions and even more options. Which series to watch today? What song to listen to next? Nowadays, the internet and its vast content offer too many choices. But there is hope – recommender systems are here to solve this problem and support our decision-making. They rank among the fascinating use cases for machine learning and intelligently filter information to present us with a smaller set of options that most likely fit our tastes and needs. This tutorial will explore recommender systems and implement a movie recommender in Python that uses “Collaborative Filtering.”

This article is structured as follows: We begin by briefly going through the basics of different types of recommender systems. Then we will look at the most common recommender algorithms and go into more detail on Collaborative Filtering. Once equipped with this conceptual understanding, we will develop our recommender system using the popular 100k Movies Dataset. We will train and test a recommender model to predict movie ratings. The library used is the Scikit-Surprise Python library. The recommendation approach combines Collaborative Filtering and Singular Value Decomposition (SVD).

An Overview of Recommender Techniques

The first attempts with recommendation systems reach back to the 1970s. The approach was relatively simple and categorized users into groups to suggest the same content to all users in the same group. However, it was a breakthrough because, for the first time, a program could make personalized recommendations. As the importance of recommendation engines increased, so did the interest in improving their predictions.

With the rise of the internet and the rapidly growing amount of data available on the web, filtering relevant content has become increasingly important. Large tech companies such as Netflix (TV shows and movies), Amazon (products), or Facebook (user content and profiles) understood early on that they could use recommendation systems to personalize the selection of user content shown to their customers. They all face the same challenge of having massive content and only limited space to display it to their users. Therefore, it has become crucial for them to select and display only the content that matches the individual interests of their users.

Internet companies are predestined for using recommender systems, as they have massive amounts of user data available. The user data is the foundation for generating personalized recommendations on a large scale by analyzing behavior patterns among larger groups of users to tailor suggestions to the taste of individuals.

The more choices we have, the more difficult it gets to decide

Three Common Approaches to Recommender Systems

Nowadays, many different approaches can generate recommendations. However, most recommender systems use one of the following three techniques:

Popular techniques used to build recommender systems

Content-based Filtering

Content-based filtering is a technique that recommends similar items based on item content. Naturally, this approach is based on metadata to determine which items are similar. For example, in the case of movie recommendations, the algorithm would look at the genre, cast, or director of a movie. Models may also consider metadata on users, such as age, gender, etc., to suggest similar content to similar users. There are different methods to calculate the similarity, for example, Cosine Similarity or Minkowski Distance.

A significant challenge in content-based Filtering is the transferability of user preference insights from one item type to another. Often, content-based recommenders struggle to transfer user actions on one item (e.g., book) to other content types (e.g., clothing). In addition, content-based systems tend to develop tunnel vision, leading the engine to recommend more and more of the same.

A separate article covers content-based filtering in more detail and shows how to implement this approach for movie recommendations.

Collaborative Filtering

Collaborative Filtering is a well-established approach used to build recommendation systems. The recommendations generated through Collaborative Filtering are based on past interactions between a user and a set of items (movies, products, etc.) that are matched against past item-user interactions within a larger group of people. The main idea is to use the interactions between a group of users and a group of items to guess how users rate items they have not yet rated before.

A challenge of collaborative filters is known as the cold start problem, which refers to the entry of new users into the system without any ratings. As a result, the engine does not know its interests and cannot make meaningful recommendations. The same applies to new items entering the system (e.g., products) that have not yet received any ratings. As a result, recommendations can become self-reinforcing. Popular content that many users have rated is recommended to almost all other users, making this content even more popular. On the other hand, the engine hardly suggests content with few or no ratings, so no one will rate this content.

Hybrid Approach

Combining the previous two techniques in a hybrid approach is also possible. We can implement hybrid approaches by generating content-based and collaborative-based predictions separately and then combining them. The result is a model that considers the interactions between users and items and context information. Hybrid recommender systems often achieve better results than recommendation approaches that use a single of the underlying techniques.

Netflix is known to use a hybrid recommendation system. Its engine recommends content to its users based on similar users’ viewing and search habits (Collaborative Filtering). At the same time, it also recommends series and movies whose characteristics match the content users rated highly (content-based Filtering).

How Model-based Collaborative Filtering Works

We can further differentiate between memory-based and model-based Collaborative Filtering. This tutorial focuses on model-based Collaborative Filtering, which is more commonly used.

Behavioral Patterns: Dependencies among Users and Items

Collaborative filtering searches for behavioral patterns in interactions between a group of users and a group of items to infer the interests of individuals. The input data for collaborative Filtering is typically in the form of a user/item matrix filled with ratings (as shown below).

Patterns can exist in the user/item matrix in dependencies between users and items. Some dependencies are easy to grasp. Similar dependencies exist between items in that some movies receive high ratings from the same users. For example, assume two users, Ron and Joe, have rated movies. Ron enjoyed Batman, Indiana Jones, Star Wars, and Godzilla. Joe enjoyed the same movies as Ron, except Godzilla, which he has not yet rated. Based on the similarity between Joe and Ron, we assume that Joe would also enjoy Godzilla.

Things get more complex as latent dependencies are present in the data. Imagine Bob gave a three-star rating to five different movies. Another user, Jenny, rated the same movies as Bob but always gave four stars, which is an example of latent dependency. There is some form of dependence between the two users, and although it is not as significant as in the first example, considering latent dependencies will improve predictions.

user/movies matrix

Machine Learning and Dimensionality Reduction

Model-based collaborative filtering techniques estimate the parameters of statistical models to predict how individual users would rate an unrated item. A widely used approach formulates this problem as a classification task that considers items over users as features and ratings as prediction labels (as shown in the matrix). We can use various algorithms to solve such an optimization problem, including gradient-based techniques or alternating least squares.

However, user/item matrices can become very large, making searching for patterns computationally expensive. Also, users will typically rate only a tiny fraction of the items in the matrix, so algorithms must deal with an abundant number of missing values (sparse matrix). Therefore, combining machine learning or deep learning with techniques for dimensionality reduction has become state-of-the-art.

One of the most widely used techniques for dimensionality reduction is matrix factorization. This approach compresses the initial sparse user/item matrix and presents it as separate matrices that present items and users as unknown feature vectors (as shown below). Such a matrix is densely populated and thus easier to handle, but it also enables the model to uncover latent dependencies among items and users, which increases model accuracy.

Matrix Factorization is applied to the sparse Items / User Matrix.

Python Libraries for Collaborative Filtering

So far, only a few Python libraries support model-based collaborative Filtering out of the box. The most well-known libraries for recommender systems are probably Scikit-Suprise and Fast.ai for Pytorch.

Below you find an overview of the different algorithms that these libraries support.

Implementing a Movie Recommender in Python using Collaborative Filtering

Now it’s time to get our hands dirty and begin with implementing our movie recommender. As always, you find the code in the relataly git-hub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

Before beginning the coding part, make sure that you have set up your Python 3 environment and required packages. If you don’t have an environment, consider the Anaconda Python environment. Follow this tutorial to set it up.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using Seaborn for visualization and the recommender systems library Scikit-Suprise. You can install the surprise package by forging it with the following command:

conda install -c conda-forge scikit-surprise

You can install the other packages using standard console commands:

pip install
conda install (if you are using the anaconda packet manager)

About the IMDB Movies Dataset

We will train our movie recommendation model on a popular Movies Dataset (you can download it from grouplens.org). The MovieLens recommendation service collected the Dataset from 610 users between 1996 and 2018. Unpack the data into the working folder of your project.

The full Dataset contains metadata on over 45,000 movies and 26 million ratings from over 270,000 users. However, we will be working with a subset of the data “ratings_small.csv,” which contains 100,836 ratings from 700 users on 9742 movies.

The Dataset contains the following files, from which we will only use the first two (Source of the data description: Kaggle.com):

movies_metadata.csv: The main Movies Metadata file contains information on 45,000 movies featured in the Full MovieLens Dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries, and companies.
ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies. Each line corresponds to a single 5-star movie rating with half-star increments (0.5 – 5.0 stars).

There are several other files included that we won’t use.

recomender systems collaborative filtering imdb movies

" data-image-caption="

recomender systems collaborative filtering imdb movies

IMDB Movie Database

Step #1: Load the Data

Ensure you have downloaded and unpacked the data and the required packages.

You can then load the movie data into our Python project using the code snippet below. We do not need all of the files in the movie dataset and only work with the following two.

movies_metadata.csv
ratings_small.csv

1.1 Load the Movies Data

First, we will load the movies_metadata, which contains a list of all movies and meta information such as the release year, a short description, etc.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from surprise import SVD, Dataset, Reader
from surprise.model_selection import train_test_split, cross_validate
from ast import literal_eval

# in case you have placed the files outside of your working directory, you need to specify a path
path = '' # for example: 'data/movie_recommendations/'  

# load the movie metadata
df_moviesmetadata=pd.read_csv(path + 'movies_metadata.csv', low_memory=False) 
print(df_moviesmetadata.shape)
print(df_moviesmetadata.columns)
df_moviesmetadata.head(1)

(45466, 24)
Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')
	adult	belongs_to_collection								budget		genres												homepage								id	imdb_id		original_language	original_title	overview	...	release_date	revenue	runtime	spoken_languages	status	tagline	title	video	vote_average	vote_count
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	[{'id': 16, 'name': 'Animation'}, {'id': 35, '...	http://toystory.disney.com/toy-story	862	tt0114709	en					Toy Story		Led by Woody, Andy's toys live happily in his ...	...	1995-10-30	373554033.0	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	NaN	Toy Story	False	7.7	5415.0
1 rows × 24 columns

1.2 Load the Ratings Data

We proceed by loading the rating file. This file contains the movie ratings for each user, the movieId, and a timestamp.

In addition, we print the value counts for rankings in our Dataset.

# load the movie ratings
df_ratings=pd.read_csv(path + 'ratings_small.csv', low_memory=False) 

print(df_ratings.shape)
print(df_ratings.columns)
df_ratings.head(3)

rankings_count = df_ratings.rating.value_counts().sort_values()
sns.barplot(x=rankings_count.index.sort_values(), y=rankings_count, color="b")
sns.set_theme(style="whitegrid")

(100004, 4)
Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')
	userId	movieId	rating	timestamp
0	1		31		2.5		1260759144
1	1		1029	3.0		1260759179
2	1		1061	3.0		1260759182

As we can see, most of the ratings in our Dataset are positive.

Step #2 Preprocessing and Cleaning the Data

We continue with the preprocessing of the data. The recommendations of a User-based Collaborative Filtering Approach rely solely on the interactions between users and items. This means training a prediction model does not require the meta-information of the movies. Nevertheless, we will load the metadata because it is just nicer to display the recommendations, movie title, release year, and so on, instead of just ids.

2.1 Clean the Movie Data

Unfortunately, the data quality of the movies’ metadata is not excellent, so we need to fix a few things. The following operations will change some data types to integers, extract the release year and genres, and remove some records with incorrect data.

# remove invalid records with invalid ids
df_mmeta = df_moviesmetadata.drop([19730, 29503, 35587])

df_movies = pd.DataFrame()

# extract the release year 
df_movies['year'] = pd.to_datetime(df_mmeta['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

# extract genres
df_movies['genres'] = df_mmeta['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

# change the index to movie_id
df_movies['movieId'] = pd.to_numeric(df_mmeta['id'])
df_movies = df_movies.set_index('movieId')

# add vote count
df_movies['vote_count'] = df_movies['vote_count'].astype('int')
df_movies

title									vote_count	vote_average	year	genres
		movieId					
862		Toy Story						5415			7.7	1995	[Animation, Comedy, Family]
8844	Jumanji							2413			6.9	1995	[Adventure, Fantasy, Family]
15602	Grumpier Old Men				92				6.5	1995	[Romance, Comedy]
31357	Waiting to Exhale				34				6.1	1995	[Comedy, Drama, Romance]
11862	Father of the Bride Part II		173				5.7	1995	[Comedy]
...	...	...	...	...	...
49279	The Man with the Rubber Head	29				7.6	1901	[Comedy, Fantasy, Science Fiction]
49271	The Devilish Tenant				12				6.7	1909	[Fantasy, Comedy]
49280	The One-Man Band				22				6.5	1900	[Fantasy, Action, Thriller]
404604	Mom								14				6.6	2017	[Crime, Drama, Thriller]
30840	Robin Hood						26				5.7	1991	[Drama, Action, Romance]
22931 rows × 5 columns

2.2 Clean the Ratings Data

Compared to the movie metadata, not much more needs to be done to the rating data. Here we just put the timestamp into a readable format.

One of the following steps is to use the Reader class from the Surprise library to parse the ratings and put them into a format compatible with standard recommendation algorithms from the Surprise library. The Reader needs the data in the form where each line contains only one rating and respects the following structure:

user; item ; rating ; timestamp

# drop na values
df_ratings_temp = df_ratings.dropna()

# convert datetime
df_ratings_temp['timestamp'] = pd. to_datetime(df_ratings_temp['timestamp'], unit='s')

print(f'unique users: {len(df_ratings_temp.userId.unique())}, ratings: {len(df_ratings_temp)}')
df_ratings_temp.head()

unique users: 671, ratings: 100004
	userId		movieId	rating	timestamp
0	1			31		2.5		2009-12-14 02:52:24
1	1			1029	3.0		2009-12-14 02:52:59
2	1			1061	3.0		2009-12-14 02:53:02
3	1			1129	2.0		2009-12-14 02:53:05
4	1			1172	4.0		2009-12-14 02:53:25

Step #3: Split the Data in Train and Test

Next, we will split the data into train and test sets. In this way, we ensure that we can later evaluate the performance of our recommender model on data that the model has not yet seen.

# The Reader class is used to parse a file containing ratings.
# The file is assumed to specify only one rating per line, such as in the df_ratings_temp file above.
reader = Reader()
ratings_by_users = Dataset.load_from_df(df_ratings_temp[['userId', 'movieId', 'rating']], reader)

# Split the Data into train and test
train_df, test_df = train_test_split(ratings_by_users, test_size=.2)

Once we have split the data into train and test, we can train the recommender model.

Step #4: Train a Movie Recommender using Collaborative Filtering

Training the SVD model requires only lines of code. The first line creates an untrained model that uses Probabilistic Matrix Factorization for dimensionality reduction. The second line will fit this model to the training data.

# train an SVD model
svd_model = SVD()
svd_model_trained = svd_model.fit(train_df)

Step #5: Evaluate Prediction Performance using Cross-Validation

Next, it is time to validate the performance of our movie recommendation program. For this, we use k-fold cross-validation. As a reminder, cross-validation involves splitting the Dataset into different folds and then measuring the prediction performance based on each fold.

We can measure model performance using indicators such as mean absolute error (MAE) or mean squared error (MSE). The MAE is the average difference between predicting a movie and the actual ratings. We chose this measure because it is easy to understand.

# 10-fold cross validation 
cross_val_results = cross_validate(svd_model_trained, ratings_by_users, measures=['RMSE', 'MAE', 'MSE'], cv=10, verbose=False)
test_mae = cross_val_results['test_mae']

# mean squared errors per fold
df_test_mae = pd.DataFrame(test_mae, columns=['Mean Absolute Error'])
df_test_mae.index = np.arange(1, len(df_test_mae) + 1)
df_test_mae.sort_values(by='Mean Absolute Error', ascending=False).head(15)

# plot an overview of the performance per fold
plt.figure(figsize=(6,4))
sns.set_theme(style="whitegrid")
sns.barplot(y='Mean Absolute Error', x=df_test_mae.index, data=df_test_mae, color="b")
# plt.title('Mean Absolute Error')

The chart above shows that the mean deviation of our predictions from the actual rating is a little below 0.7. The result is not terrific but ok for a first model. In addition, there are no significant differences between the performance in the different folds. Let’s keep in mind that the MAE says little about possible outliers in the predictions. However, since we are dealing with ordinal predictions (1-5), the influence of outliers is naturally limited.

Step #6: Generate Predictions

Finally, we will use our movie recommender to generate a list of suggested movies for a specific test user. The predictions will be based on the user’s previous movie ratings.

# predict ratings for a single user_id and for all movies
user_id = 400 # some test user from the ratings file

# create the predictions
pred_series= []
df_ratings_filtered = df_ratings[df_ratings['userId'] == user_id]

print(f'number of ratings: {df_ratings_filtered.shape[0]}')
for movie_id, name in zip(df_movies.index, df_movies['title']):
    # check if the user has already rated a specific movie from the list
    rating_real = df_ratings.query(f'movieId == {movie_id}')['rating'].values[0] if movie_id in df_ratings_filtered['movieId'].values else 0
    # generate the prediction
    rating_pred = svd_model_trained.predict(user_id, movie_id, rating_real, verbose=False)
    # add the prediction to the list of predictions
    pred_series.append([movie_id, name, rating_pred.est, rating_real])

# print the results
df_recommendations = pd.DataFrame(pred_series, columns=['movieId', 'title', 'predicted_rating', 'actual_rating'])
df_recommendations.sort_values(by='predicted_rating', ascending=False).head(15)

		movieId	title								predicted_rating	actual_rating
4234	4993	5 Card Stud							4.721481			0.0
3194	318		The Million Dollar Hotel			4.648623			5.0
442		858		Sleepless in Seattle				4.506962			0.0
236		527		Once Were Warriors					4.484758			4.0
2426	926		Galaxy Quest						4.465653			0.0
6532	905		Pandora's Box						4.452688			0.0
710		260		The 39 Steps						4.390318			0.0
8787	3683	Flags of Our Fathers				4.386821			0.0
5400	899		Broken Blossoms						4.384220			0.0
5068	296		Terminator 3: Rise of the Machines	4.383057			4.0
372		2019	Hard Target							4.365834			0.0
7254	919		Blood: The Last Vampire				4.356012			0.0
3295	4973	Under the Sand						4.355750			0.0
3869	194		Amélie								4.353614			0.0
8631	1948	Crank								4.344286			0.0

Alternatively, we can predict how well a specific user will rate a movie by handing the user_id and the movie_id to the model.

# predict ratings for the combination of user_id and movie_id
user_id = 217 # some test user from the ratings file
movie_id = 4002
rating_real = df_ratings.query(f'movieId == {movie_id} & userId == {user_id}')['rating'].values[0]
movie_title = df_movies[df_movies.index == 862]['title'].values[0]

print(f'Movie title: {movie_title}')
print(f'Actual rating: {rating_real}')

# predict and show the result
rating_pred = svd_model_trained.predict(user_id, movie_id, rating_real, verbose=True)

Movie title: Toy Story
Actual rating: 4.5
user: 217        item: 4002       r_ui = 4.50   est = 3.98   {'was_impossible': False}

Summary

Congratulations on learning how to develop a movie recommendation system in Python! In this article, you learned about the SVD model, which uses matrix factorization and collaborative filtering to predict movie ratings for a given user. We also demonstrated how to perform cross-validation on a movie dataset and use the model to generate movie recommendations.

Several other approaches can be used to develop a movie recommendation system, including content-based filtering, which uses features of the movies themselves (such as genre, director, and cast) to make recommendations, and hybrid systems, which combine the strengths of multiple approaches.

Regardless of the approach used, building a movie recommendation system can be a useful tool for recommending movies to users based on their past preferences and can help increase engagement and satisfaction with a movie streaming service or website.

If you like the post, please let me know in the comments, and don’t forget to subscribe to our Twitter account to stay up to date on upcoming articles.

Sources and Further Reading

Below are some resources for further reading on recommender systems and content-based models.

Books

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Articles

The post Build a High-Performing Movie Recommender System using Collaborative Filtering in Python appeared first on relataly.com.

Forecasting Beer Sales with ARIMA in Python

Florian Follonier — Wed, 03 Feb 2021 22:23:08 +0000

Time series analysis and forecasting is a tough nut to crack, but the ARIMA model has been cracking it for decades. ARIMA, short for “Auto-Regressive Integrated Moving Average,” is a powerful statistical modeling technique for time series analysis. It’s particularly effective when the time series you’re analyzing follows a clear pattern, like seasonal changes in weather or sales. ARIMA has been used to forecast everything from beer sales to order quantities, and this tutorial will show you how to build your own ARIMA model in Python. You’ll be making predictions like a pro in no time!

This tutorial proceeds in two parts: The first part covers the concepts behind ARIMA. You will learn how ARIMA works, what Stationarity means, and when it is appropriate to use ARIMA. The second part is a Python hands-on tutorial that applies auto-ARIMA to the Sales Forecasting domain. We’ll be working with a time series of beer sales, and our goal is to predict how the beer sales quantities will evolve in the coming years. First, we check if the time series is stationary. Then we train an ARIMA forecasting model. Finally, we use the model to produce a sales forecast and measure the model’s performance.

About Sales Forecasting

Sales forecasting is a crucial business strategy that involves predicting future sales volumes for a product (for example, beer) or service. It leverages sophisticated statistical and analytical techniques, such as time series analysis or machine learning algorithms, to scrutinize historical sales data. By identifying trends and patterns within this data, businesses can make informed predictions about their future sales performance.

This strategic forecasting plays a pivotal role in business operations. It is instrumental in guiding key decisions surrounding production, inventory management, staffing, and various other operational elements. By honing in on accurate sales forecasting, businesses can strike the perfect balance – maintaining enough inventory to meet customer demand without overproducing or overstocking. This equilibrium ensures a smooth flow in the supply chain and avoids unnecessary costs tied to excess production or storage.

Furthermore, sales forecasting serves as a roadmap for business growth. It aids in identifying potential market opportunities and predicting future sales revenue. This valuable foresight enables businesses to strategically plan their expansion, ensuring resources are optimally utilized and future goals are met. With this in-depth understanding of sales forecasting, businesses can stay ahead of market trends, navigate through business challenges, and ultimately steer towards success.

Businesses rely on sales forecasting to make informed decisions about production, inventory management, staffing, and other key operational aspects. Image created with Midjourney

" data-image-caption="

Businesses rely on sales forecasting to make informed decisions about production, inventory management, staffing, and other key operational aspects. Image created with Midjourney

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png" alt="Businesses rely on sales forecasting to make informed decisions about production, inventory management, staffing, and other key operational aspects. Image created with Midjourney" class="wp-image-12602" srcset="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png 506w, https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png 300w, https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly-min.png 140w" sizes="(max-width: 506px) 100vw, 506px" />

Businesses rely on sales forecasting to make informed decisions about production, inventory management, staffing, and other key operational aspects. Image created with Midjourney

Introduction to ARIMA Time Series Modelling

ARIMA models provide an alternative approach to time series forecasting that differs significantly from machine learning methods. Working with ARIMA requires a good understanding of Stationarity and knowledge of the transformations used to make time-series data stationary. The concept of Stationarity is, therefore, first on our schedule.

The Concept of Stationarity

Stationarity is an essential concept in stochastic processes that describes the nature of a time series. We consider a time series strictly stationary if its statistical properties do not change over time. In this case, summary statistics, such as the mean and variance, do not change over time. However, the time-series data we encounter in the real world often show a trend or significant irregular fluctuations, making them non-stationary or weakly stationary.

So why is Stationarity such an essential concept for ARIMA? If a time series is stationary, we can assume that the past values of the time series are predictive of future development. In other words, a stationary time series exhibits consistent behavior that makes it predictable. On the other hand, a non-stationary time series is characterized by a kind of random behavior that will be difficult to capture in modeling. Namely, if random movements characterized the past, there is a high probability that the future will be no different.

Fortunately, in many cases, it is possible to transform a time series that is non-stationary into a stationary form and, in this way, build better prediction models.

A stationary Vs. a non-stationary time series

How to Test Whether a Time Series is Stationary

The first step in the ARIMA modeling approach is determining whether a time series is stationary. There are different ways to determine whether a time series is stationary:

Plotting: We can plot the time series and visually check if it shows consistent behavior or changes over a more extended period.
Summary statistics: We can split the time series into different periods and calculate the summary statistics, such as the variance. If these metrics are subject to significant changes, the time series is non-stationary. However, the results will also depend on the respective periods, leading to false conclusions.
Statistic tests: There are various tests to determine the stationary of a time series, such as Kwiatkowski–Phillips–Schmidt–Shin, Augmented Dickey-Fuller, or Phillips–Perron. These tests systematically check a time series and measure the results against the null hypothesis, providing an indicator of the trustworthiness of the results.

What is an (S)ARIMA Model?

As the name implies, ARIMA uses autoregression (AR), integration (differencing), and moving averages (MA) to fit a linear regression model to a time series.

ARIMA Parameters

The default notation for ARIMA is a model with parameters p, d, and q, whereby each parameter takes an integer value:

d (differencing): In the case of a non-stationary time series, there is a chance to remove a trend from the data by differencing once or several times, thus bringing the data to a stationary state. The model parameter d determines the order of the differentiation. A value of d = 0 simplifies the ARIMA model to an ARMA model, lacking the integration aspect. If this is the case, we do not need to integrate the function because the time series is already stationary.
p (order of the AR terms): The autoregressive process describes the dependent relationship between an observation and several lagged observations (lags). Predictions are then based on past data from the same time series using linear functions. p = 1 means the model uses values that lag by one period.
q (order of the MA terms): The parameter q determines the number of lagged forecast errors in the prediction equation. In contrast to the AR process, the MA process assumes that values at a future point in time depend on the errors made by predictions at current and past points in time. This means that it is not previous events that determine the predictions but rather the previous estimation or prediction errors used to calculate the following time series value.

SARIMA

In the real world, many time series have seasonal effects. Examples are monthly retail sales figures, temperature reports, weekly airline passenger data, etc. To consider this, we can specify a seasonal range (e.g., m=12 for monthly data) and additional seasonal AR or MA components for our model that deal with seasonality. Such a model is also called a SARIMA model, and we can define it as a model(p, d, q)(P, D, Q)[m].

Auto-(S)ARIMA

When working with ARIMA, we can set the model parameters manually or use auto-ARIMA and let the model search for the optimal parameters. We do this by varying the parameters and then testing against Stationarity. With the seasonal option enabled, the process tries to identify the optimal hyperparameters for the seasonal components of the model. Auto-ARIMA works by conducting differencing tests to determine the order of differencing, d and then fitting models with parameters in defined ranges, e.g., start_p, max_p as well as start_q, max_q. If our model has a seasonal component, we can also define parameter ranges for the seasonal part of the model.

Creating a Sales Forecast with ARIMA in Python

Having grasped the fundamental concepts behind ARIMA (AutoRegressive Integrated Moving Average), we’re now ready to dive into the practical aspect of crafting a sales forecasting model in Python. Utilizing ARIMA for forecasting sales data is an esteemed practice owing to the algorithm’s adeptness in modeling seasonal changes combined with long-term trends – a characteristic commonly exhibited by sales data.

In this tutorial, we’ll be employing a dataset representing the monthly beer sales across the United States from 1992 through 2018, recorded in millions of US dollars. Our objective is to construct a robust time series model using ARIMA to accurately predict future sales trends.

When it comes to the technological aspect, we’ll be using the Python-based ‘statsmodels’ and ‘pmdarima’ libraries to build our ARIMA sales forecasting model. So, if you’re ready to harness the power of Python and ARIMA for sales prediction, let’s get started!

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

A fluffy cat drinking beer after creating an ARIMA sales forecast. Image created with Midjourney

Prerequisites

Before we start coding, ensure you have set up your Python 3 environment and required packages. If you don’t have an environment, you can follow this tutorial to set up the Anaconda environment.

Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

In addition, we will be using the statsmodels library and pmdarima.

You can install packages using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1 Load the Sales Data to Our Python Project

In the initial step of this tutorial, we commence by setting up the necessary Python environment. We import several packages that we’ll be using for data manipulation, visualization, and implementing machine learning models. We then fetch the dataset we’ll be working with – the monthly beer sales in the United States from 1992 through 2018. This data is sourced from a publicly accessible URL and loaded into a pandas DataFrame.

# A tutorial for this file is available at www.relataly.com
# Tested with Python 3.88

# Setting up packages for data manipulation and machine learning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pmdarima as pm
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.seasonal import seasonal_decompose
import seaborn as sns
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})

# Link to the dataset: 
# https://www.kaggle.com/bulentsiyah/for-simple-exercises-time-series-forecasting

path = "https://raw.githubusercontent.com/flo7up/relataly_data/main/alcohol_sales/BeerWineLiquor.csv"
df = pd.read_csv(path)
df.head()

		date	beer
0	1/1/1992	1509
1	2/1/1992	1541
2	3/1/1992	1597
3	4/1/1992	1675
4	5/1/1992	1822

As shown above, the sales figures in this dataset stem from the first day of each month.

Step #2 Visualize the Time Series and Check it for Stationarity

Before modeling the sales data, we visualize the time series and test it for Stationarity. Visualization helps us choose the parameters for our ARIMA model, thus making it an essential step.

First, we will look at the different components of the time series. We do this by using the seasonal_decompose function of the statsmodels library.

# Decompose the time series
plt.rcParams["figure.figsize"] = (10,6)
result = seasonal_decompose(df['beer'], model='multiplicative', period = 12)
result.plot()
plt.show()

To test for Stationarity, we use the ADFuller test. It is common to run this test multiple times throughout a data science project. Therefore, we create a function that we can then reuse later.

def check_stationarity(df_sales, title_string, labels):
    # Visualize the data
    fig, ax = plt.subplots(figsize=(16, 8))
    plt.title(title_string, fontsize=14)
    if df_sales.index.size > 12:
        df_sales['ma_12_month'] = df_sales['beer'].rolling(window=12).mean()
        df_sales['ma_25_month'] = df_sales['beer'].rolling(window=25).mean()
        sns.lineplot(data=df_sales[['beer', 'ma_25_month', 'ma_12_month']], palette=sns.color_palette("mako_r", 3))
        plt.legend(title='Smoker', loc='upper left', labels=labels)
    else:
        sns.lineplot(data=df_sales[['beer']])
    
    plt.show()
    
    sales = df_sales['beer'].dropna()
    # Perform an Ad Fuller Test
    # the default alpha = .05 stands for a 95% confidence interval
    adf_test = pm.arima.ADFTest(alpha = 0.05) 
    print(adf_test.should_diff(sales))
    
df_sales = pd.DataFrame(df['beer'], columns=['beer'])
df_sales.index = pd.to_datetime(df['date']) 
title = "Beer sales in the US between 1992 and 2018 in million US$/month"
labels = ['beer', 'ma_12_month', 'ma_25_month']
check_stationarity(df_sales, title, labels)

The data does not appear to be stationary. We can see that our time series is steadily increasing and shows annual seasonality. The steady increase indicates a continuous growth in beer consumption over the last decades. The seasonality in the sales data likely results from people drinking more beer in summer than in other seasons.

Step #3 Exemplary Differencing and Autocorrelation

The chart from the previous section shows that our time series is non-stationary. The reason is that it follows a clear upward trend. We also know that the time series has a seasonal component. Therefore, we need to define additional parameters and construct a SARIMA model.

Before we use auto-correlation to determine the optimal parameters, we will try manual differencing to make the time series stationary. There is no guarantee that differencing works. It is essential to remember that differencing can sometimes also worsen prediction performance. So be careful, not to overdifference! We could also trust that the auto-ARIMA model chooses the best parameters for us. However, we should always validate the selected parameters.

The ideal differencing parameter is the least number of differencing steps to achieve a stationary time series. We will monitor the results with autocorrelation plots to check whether differencing was successful.

We print the autocorrelation for the original time series and after the first and second-order differencing.

# 3.1 Non-seasonal part
def auto_correlation(df, prefix, lags):
    plt.rcParams.update({'figure.figsize':(7,7), 'figure.dpi':120})
    
    # Define the plot grid
    fig, axes = plt.subplots(3,2, sharex=False)

    # First Difference
    axes[0, 0].plot(df)
    axes[0, 0].set_title('Original' + prefix)
    plot_acf(df, lags=lags, ax=axes[0, 1])

    # First Difference
    df_first_diff = df.diff().dropna()
    axes[1, 0].plot(df_first_diff)
    axes[1, 0].set_title('First Order Difference' + prefix)
    plot_acf(df_first_diff, lags=lags - 1, ax=axes[1, 1])

    # Second Difference
    df_second_diff = df.diff().diff().dropna()
    axes[2, 0].plot(df_second_diff)
    axes[2, 0].set_title('Second Order Difference' + prefix)
    plot_acf(df_second_diff, lags=lags - 2, ax=axes[2, 1])
    plt.tight_layout()
    plt.show()
    
auto_correlation(df_sales['beer'], '', 10)

(0.019143247561160443, False)

The charts above show that the time series becomes stationary after one order differencing. However, we can see that the lag goes into the negative very quickly, which indicates overdifferencing.

Next, we perform the same procedure for the seasonal part of our time series.

# 3.2 Seasonal part

# Reduce the timeframe to a single seasonal period
df_sales_s = df_sales['beer'][0:12]

# Autocorrelation for the seasonal part
auto_correlation(df_sales_s, '', 10)

# Check if the first difference of the seasonal period is stationary
df_diff = pd.DataFrame(df_sales_s.diff())
df_diff.index = pd.date_range(df_sales_s.diff().iloc[1], periods=12, freq='MS') 
check_stationarity(df_diff, "First Difference (Seasonal)", ['difference'])

(0.99, True)

After first order differencing, the seasonal part of the time series is stationary. The autocorrelation plot shows that the values go into the negative but remain within acceptable boundaries. Second-order differencing does not seem to improve these values. Consequently, we conclude that first-order differencing is a good choice for the D parameter.

Step #4 Finding an Optimal Model with Auto-ARIMA

Next, we auto-fit an ARIMA model to our time series. In this way, we ensure that we can later measure the performance of our model against a fresh set of data that the model has not seen so far. We will split our dataset into train and test in preparation for this.

Once we have created the train and test data sets, we can configure the parameters for the auto_arima stepwise optimization. By setting max_d = 1, we tell the model to test no-differencing and first-order differencing. Also, we set max_p and max_q to 3.

To deal with the seasonality in our time series, we set the “seasonal” parameter to True and the “m” parameter to 12 data points. We turn our model into a SARIMA model that allows us to configure additional D, P, and Q parameters. We define a max value for Q and P of 3. Previously we have already seen that further differencing does not improve the Stationarity. Therefore, we can set the value of D to 1.

After configuring the parameters, we next fit the model to the time series. The model will try to find the optimal parameters and choose the model with the least AIC.

# split into train and test
pred_periods = 30
split_number = df_sales['beer'].count() - pred_periods # corresponds to a prediction horizion  of 2,5 years
df_train = pd.DataFrame(df_sales['beer'][:split_number]).rename(columns={'beer':'y_train'})
df_test = pd.DataFrame(df_sales['beer'][split_number:]).rename(columns={'beer':'y_test'})

# auto_arima
model_fit = pm.auto_arima(df_train, test='adf', 
                         max_p=3, max_d=3, max_q=3, 
                         seasonal=True, m=12,
                         max_P=3, max_D=2, max_Q=3,
                         trace=True,
                         error_action='ignore',  
                         suppress_warnings=True, 
                         stepwise=True)

# summarize the model characteristics
print(model_fit.summary())

Performing stepwise search to minimize aic
 ARIMA(2,0,2)(1,1,1)[12] intercept   : AIC=inf, Time=3.89 sec
 ARIMA(0,0,0)(0,1,0)[12] intercept   : AIC=3383.210, Time=0.02 sec
 ARIMA(1,0,0)(1,1,0)[12] intercept   : AIC=3351.655, Time=0.38 sec
 ARIMA(0,0,1)(0,1,1)[12] intercept   : AIC=3364.350, Time=1.09 sec
 ARIMA(0,0,0)(0,1,0)[12]             : AIC=3604.145, Time=0.02 sec
 ARIMA(1,0,0)(0,1,0)[12] intercept   : AIC=3349.908, Time=0.11 sec
 ARIMA(1,0,0)(0,1,1)[12] intercept   : AIC=3351.532, Time=0.29 sec
 ARIMA(1,0,0)(1,1,1)[12] intercept   : AIC=3353.520, Time=1.24 sec
 ARIMA(2,0,0)(0,1,0)[12] intercept   : AIC=3312.656, Time=0.10 sec
 ARIMA(2,0,0)(1,1,0)[12] intercept   : AIC=3314.483, Time=0.57 sec
 ARIMA(2,0,0)(0,1,1)[12] intercept   : AIC=3314.378, Time=0.30 sec
 ARIMA(2,0,0)(1,1,1)[12] intercept   : AIC=3305.552, Time=3.02 sec
 ARIMA(2,0,0)(2,1,1)[12] intercept   : AIC=3291.425, Time=4.19 sec
 ARIMA(2,0,0)(2,1,0)[12] intercept   : AIC=3306.914, Time=3.06 sec
 ARIMA(2,0,0)(3,1,1)[12] intercept   : AIC=3276.501, Time=4.67 sec
 ARIMA(2,0,0)(3,1,0)[12] intercept   : AIC=3282.240, Time=5.24 sec
 ARIMA(2,0,0)(3,1,2)[12] intercept   : AIC=inf, Time=7.39 sec
 ARIMA(2,0,0)(2,1,2)[12] intercept   : AIC=inf, Time=4.74 sec
 ARIMA(1,0,0)(3,1,1)[12] intercept   : AIC=3313.877, Time=5.17 sec
 ARIMA(3,0,0)(3,1,1)[12] intercept   : AIC=3246.820, Time=5.72 sec
 ARIMA(3,0,0)(2,1,1)[12] intercept   : AIC=3255.313, Time=5.33 sec
 ARIMA(3,0,0)(3,1,0)[12] intercept   : AIC=3249.998, Time=6.77 sec
 ARIMA(3,0,0)(3,1,2)[12] intercept   : AIC=inf, Time=8.39 sec
 ARIMA(3,0,0)(2,1,0)[12] intercept   : AIC=3259.938, Time=3.55 sec
...
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Auto-ARIMA has determined that the best model is (3,0,0)(3,1,1). These results match the results from section 3, in which we manually performed differencing.

Step #5 Simulate the Time Series using in-sample Forecasting

Now that we have trained our model, we want to use it to simulate the entire time series. We will do this by calling the predict method in the sample function. The prediction will match the same period as the original time series with which we trained the model. Because the model predicts one step, the prediction results will naturally be close to the actual values.

# Generate in-sample Predictions
# The parameter dynamic=False means that the model makes predictions upon the lagged values.
# This means that the model is trained until a point in the time-series and then tries to predict the next value.
pred = model_fit.predict_in_sample(dynamic=False) # works only with auto-arima
df_train['y_train_pred'] = pred

# Calculate the percentage difference
df_train['diff_percent'] = abs((df_train['x_train'] - pred) / df_train['x_train'])* 100

# Print the predicted time-series
fig, ax1 = plt.subplots(figsize=(16, 8))
plt.title("In Sample Sales Prediction", fontsize=14)
sns.lineplot(data=df_train[['x_train', 'y_train_pred']], linewidth=1.0)

# Print percentage prediction errors on a separate axis (ax2)
ax2 = ax1.twinx() 
ax2.set_ylabel('Prediction Errors in %', color='purple', fontsize=14)  
ax2.set_ylim([0, 50])
ax2.bar(height=df_train['diff_percent'][20:], x=df_train.index[20:], width=20, color='purple', label='absolute errors')
plt.legend()
plt.show()

Next, we take a look at the prediction errors.

Step #6 Generate and Visualize a Sales Forecast

Now that we have trained an optimal model, we are ready to generate a sales forecast. First, we specify the number of periods that we want to predict. In addition, we create an index from the number of predictions adjacent to the original time series and continue it (prediction_index).

# Generate prediction for n periods, 
# Predictions start from the last date of the training data
test_pred = model_fit.predict(n_periods=pred_periods, dynamic=False)
df_test['y_test_pred'] = test_pred
df_union = pd.concat([df_train, df_test])
df_union.rename(columns={'beer':'y_test'}, inplace=True)

# Print the predicted time-series
fig, ax = plt.subplots(figsize=(16, 8))
plt.title("Test/Pred Comparison", fontsize=14)
sns.despine();
sns.lineplot(data=df_union[['y_train', 'y_train_pred', 'y_test', 'y_test_pred']], linewidth=1.0, dashes=False, palette='muted')
ax.set_xlim([df_union.index[150],df_union.index.max()])
plt.legend()
plt.show()

As shown above, our model’s forecast continues the seasonal pattern of the beer sales time series. On the one hand, this indicates that US beer sales will continue to rise and, on the other hand, that our model works just fine 🙂

Step #7 Measure the Performance of the Sales Forecasting Model

In this section, we will measure the performance of our ARIMA model. To learn more about this topic, check out this relataly article measuring regression performance.

The previous section’s simulation chart shows a few outliers among the prediction errors. Therefore, we focus our analysis on the percentage errors. Two helpful metrics are the mean absolute error (MAPE) and the mean absolute percentage error (MDAPE).

# Mean Absolute Percentage Error (MAPE)
MAPE = np.mean((np.abs(np.subtract(df_test['y_test'], df_test['y_test_pred'])/ df_test['y_test']))) * 100
print(f'Mean Absolute Percentage Error (MAPE): {np.round(MAPE, 2)} %')

# Median Absolute Percentage Error (MDAPE)
MDAPE = np.median((np.abs(np.subtract(df_test['y_test'], df_test['y_test_pred'])/ df_test['y_test'])) ) * 100
print(f'Median Absolute Percentage Error (MDAPE): {np.round(MDAPE, 2)} %')

Mean Absolute Percentage Error (MAPE): 3.94 %  Median Absolute Percentage Error (MDAPE): 3.49 %

The percent errors show that our ARIMA model achieves a decent predictive performance.

Summary

This Python tutorial has shown how to use SARIMA for sales forecasting. Sales forecasting is important for businesses because it can help them to make informed decisions about production, inventory management, and staffing, among other things. By accurately forecasting sales, businesses can ensure that they have the right amount of product available to meet customer sales, avoid overproduction and excess inventory, and plan for future growth. The use cases presented were forecasting beer sales, and we have used arima to analyze seasonal sales data.

In the first part, we have learned how ARIMA works, what Stationarity is and how to check if a time series is stationary. In the second part, we developed an ARIMA model in Python to create a forecast for US beer sales. For this purpose, we created an in-sample forecast and used Auto-tARIMA to find the optimal parameters for our sales forecasting model.

If you have any questions or suggestions, please let me know in the comments, and I will do my best to answer.

Now that we have learned to use ARIMA to forecast beer sales, you really deserved yourself a beer. Cheers! Image created with Midjourney

" data-image-caption="

Now that we have learned to use ARIMA to forecast beer sales, you really deserved yourself a beer. Cheers! Image created with Midjourney

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly2-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly2-min.png" alt="Now that we have learned to use ARIMA to forecast beer sales, you really deserved yourself a beer. Cheers! Image created with Midjourney" class="wp-image-12603" srcset="https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly2-min.png 506w, https://www.relataly.com/wp-content/uploads/2023/03/brewery-arima-beer-sales-forecasting-python-tutorial-machine-learning-relataly2-min.png 300w" sizes="(max-width: 506px) 100vw, 506px" />

Now that you have learned to use ARIMA to forecast beer sales, you really earned yourself a beer. Cheers! Image created with Midjourney

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Want to learn more about time series analysis and prediction?
Check out these recent relataly tutorials:

The post Forecasting Beer Sales with ARIMA in Python appeared first on relataly.com.