Tech Archives - relataly.com

Building a Conversational Voice Bot with Azure OpenAI and Python: The Future of Human and Machine Interaction

Florian Follonier — Thu, 08 Feb 2024 21:09:50 +0000

OpenAI and Microsoft have just released a new generation of text-to-speech models that take synthetic speech to a new level. In my latest project I have combined these new models with Azure OpenAI’s ingenuine conversation capacity. The result is a conversational voice bot that uses Generative AI to converse with users in natural spoken language.

This article describes the Python implementation of this project. The bot is designed to understand spoken language and process it through OpenAI GPT-4. It responds with contextually aware dialogue, all in natural-sounding speech. This seamless integration facilitates a conversational flow that feels intuitive and engaging. The voice processing capacities enable users to have meaningful exchanges with the bot as if they were conversing with another person. Testing the bot was a lot of fun. It felt a bit like the iconic scene from Iron Man where the hereo converses with an AI assistant.

Here is an example of the audio response quality:

Also:

Understanding the Voice Bot

The magic begins with the user speaking to the bot. Azure Cognitive Services transcribes the spoken words into text, which is then fed into Azure’s OpenAI service. Here, the text is processed, and a response is generated based on the conversation’s context and history. Finally, the text-to-speech model transforms the response back into speech, completing the cycle of interaction. This process showcases the potential of AI in understanding and participating in human-like conversations.

Prerequisites & Azure Service Integration

Our conversational voice bot is built upon two pivotal Azure services: Cognitive Speech Services and OpenAI. Billing of these services is pay-per-use. Unless you process large numbers of requests, the costs for experimenting with these services is relatively low.

Azure Cognitive Speech Services

Azure AI Speech Services (previously Cognitive Speech Services) provide the tools necessary for speech-to-text conversion, enabling our voice bot to understand spoken language. This service boasts advanced speech recognition capabilities, ensuring accurate transcription of user speech into text. Furthermore, it powers the text-to-speech synthesis that transforms generated text responses back into natural-sounding voice. This allows for a truly conversational experience.

The newest generation of OpenAI text-to-speech models is now also availbale in Azure AI Speech. These models can synthesize voices in unknown level of quality. I am most impressed by its capability to alter intonation dynamically to express emotions.

Azure OpenAI Service

At the heart of our project lies Azure’s OpenAI service, which uses the power of models like GPT-4 context-aware responses. Once Azure Cognitive Speech Services transcribe the user’s speech into text, this text is sent to OpenAI. The OpenAI model then processes the input and generates a completion. The service’s ability to understand context and generate engaging responses is what makes our voice bot remarkably human-like.

Implementation: Detailed Code Walkthrough

Let’s start with the implementation! We kick things off with Azure Service Authentication, where we set up our conversational voice bot to communicate with Azure and OpenAI’s advanced services. Then, Speech Recognition steps in, acting as our bot’s ears, converting spoken words into text. Next up, Processing and Response Generation uses OpenAI’s GPT-4 to turn text into context-aware responses. Speech Synthesis then gives our bot a voice, transforming text responses back into spoken words for a natural conversational flow. Finally, Managing the Conversation keeps the dialogue coherent and engaging. Through these steps, we create a voice bot that offers an intuitive and engaging conversational experience. Let’s discuss these sections one by one.

As always, you can find the code on github:

View on GitHub Relataly Github Repo

Step #1 Azure Service Authentication

First off, we kick things off by getting our ducks in a row with Azure Service Authentication. This is where the magic starts, setting the stage for our conversational voice bot to interact with Azure’s brainy suite of Cognitive Services and the fantastic OpenAI models. By fetching API keys and setting up our service regions, we’re essentially giving our project the keys to the kingdom.

For using dotenv, you need to create an .env file in your root folder. Here is more information on how this works.

import os
from dotenv import load_dotenv
import azure.cognitiveservices.speech as speechsdk
from openai import AzureOpenAI

# Load environment variables from .env file
load_dotenv()

# Constants from .env file
SPEECH_KEY = os.getenv('SPEECH_KEY')
SERVICE_REGION = os.getenv('SERVICE_REGION')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
OPENAI_ENDPOINT = os.getenv('OPENAI_ENDPOINT')

# Azure Speech Configuration
speech_config = speechsdk.SpeechConfig(subscription=SPEECH_KEY, region=SERVICE_REGION)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
speech_config.speech_recognition_language="en-US"

# OpenAI Configuration
openai_client = AzureOpenAI(
    api_key=OPENAI_API_KEY,
    api_version="2023-12-01-preview",
    azure_endpoint=OPENAI_ENDPOINT
)

Step #2 Speech Recognition

The user’s spoken input is captured and converted into text using Azure’s Speech-to-Text service. This involves initializing the speech recognition service with Azure credentials and configuring it to listen for and transcribe spoken language in real-time.

def recognize_from_microphone():
    # Configure the recognizer to use the default microphone.
    audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)
    # Create a speech recognizer with the specified audio and speech configuration.
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    print("Speak into your microphone.")
    # Perform speech recognition and wait for a single utterance.
    speech_recognition_result = speech_recognizer.recognize_once_async().get()

    # Process the recognition result based on its reason.
    if speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:
        print("Recognized: {}".format(speech_recognition_result.text))
        # Return the recognized text if speech was recognized.
        return speech_recognition_result.text
    elif speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:
        print("No speech could be recognized: {}".format(speech_recognition_result.no_match_details))
    elif speech_recognition_result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = speech_recognition_result.cancellation_details
        print("Speech Recognition canceled: {}".format(cancellation_details.reason))
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("Error details: {}".format(cancellation_details.error_details))
            print("Did you set the speech resource key and region values?")
    # Return 'error' if recognition failed or was canceled.
    return 'error'

Step #3 Processing and Response Generation

Once we’ve got your words neatly transcribed, it’s time for the Processing and Response Generation phase. This is where OpenAI steps in, acting like the brain behind the operation. It takes your spoken words, now in text form, and churns out responses that are nothing short of conversational gold. We nudge OpenAI’s GPT-4 into generating replies that feel as natural as chatting with a close friend over coffee.

def openai_request(conversation, sample = [], temperature=0.9, model_engine='gpt-4'):
    # Initialize AzureOpenAI client with keys and endpoints from Key Vault.
    
    
    # Send a request to Azure OpenAI with the conversation context and get a response.
    response = openai_client.chat.completions.create(model=model_engine, messages=conversation, temperature=temperature, max_tokens=500)
    return response.choices[0].message.content

Step #4 Speech Synthesis

Next up, we tackle Speech Synthesis. If the previous step was the brain, consider this the voice of our operation. Taking the AI-generated text, we transform it back into speech—like turning lead into gold, but for conversations. Through Azure’s Text-to-Speech service, we’re able to give our bot a voice that’s not only clear but also surprisingly human-like.

def synthesize_audio(input_text):
    # Define SSML (Speech Synthesis Markup Language) for input text.
    ssml = f"""
        
            
                
                    {input_text}
                
            
        
        """
    
    audio_filename_path = "audio/ssml_output.wav"  # Define the output audio file path.
    print(ssml)
    # Synthesize speech from the SSML and wait for completion.
    result = speech_synthesizer.speak_ssml_async(ssml).get()

    # Save the synthesized audio to a file if synthesis was successful.
    if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
        with open(audio_filename_path, "wb") as audio_file:
            audio_file.write(result.audio_data)
        print(f"Speech synthesized and saved to {audio_filename_path}")
    elif result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = result.cancellation_details
        print(f"Speech synthesis canceled: {cancellation_details.reason}")
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print(f"Error details: {cancellation_details.error_details}")


# Create the audio directory if it doesn't exist.
if not os.path.exists('audio'):
    os.makedirs('audio')

Step #5 Managing the Conversation

Finally, we bring it all together in the Managing the Conversation step. This is where we ensure the chat keeps rolling, looping through listening, thinking, and speaking. We keep track of what’s been said to keep the conversation relevant and engaging.

The system message below makes the bot talk like a pirate. But you can easily adjust the system message and this way customize the bots behavior.

conversation=[{"role": "system", "content": "You are a helpful assistant that talks like pirate. If you encounter any issues, just tell a pirate joke or a story."}]

while True:
    user_input = recognize_from_microphone()  # Recognize user input from the microphone.
    conversation.append({"role": "user", "content": user_input})  # Add user input to the conversation context.

    assistant_response = openai_request(conversation)  # Get the assistant's response based on the conversation.

    conversation.append({"role": "assistant", "content": assistant_response})  # Add the assistant's response to the context.
    
    print(assistant_response)
    synthesize_audio(assistant_response)  # Synthesize the assistant's response into audio.

Throughout these steps, the conversation’s context is managed meticulously to ensure continuity and relevance in the bot’s responses, making the interaction feel more like a dialogue between humans.

Current Challenges

Despite the promising capabilities of our voice bot, the journey through its development and interaction has presented a few challenges that underscore the complexity of human-machine communication.

Slow Response Time

One of the notable hurdles is the slow response time experienced during interactions. The process, from speech recognition through to response generation and back to speech synthesis, involves several steps that can introduce latency. This latency can detract from the user experience, as fluid conversations typically require quick exchanges. Optimizing the interaction flow and exploring more efficient data processing methods may mitigate this issue in the future.

Handling Pauses in Speech

Another challenge lies in the bot’s handling of longer pauses while speaking. The current setup does not always allow users to pause thoughtfully without triggering the end of their input. This may sometimes lead to a situation where the model cuts off speech prematurely. This limitation points to the need for more sophisticated speech recognition algorithms capable of distinguishing between a conversational pause and the end of an utterance.

Summary

This article has shown how you can build a conversational voice bot in Python with the latest pretrained AI models. The project showcases the incredible potential of combining Azure Cognitive Services with OpenAI’s conversational models. I hope by now you understand the technical feasibility of creating voice-based applications and how they open up a world of possibilities for human-machine interaction. As we continue to refine and enhance this technology, the line between talking to a machine and conversing with a human will become ever more blurred, leading us into a future where AI companionship becomes reality.

This exploration of Azure Cognitive Services and OpenAI’s integration within a voice bot is just the beginning. As AI continues to evolve, the ways in which we interact with technology will undoubtedly transform, making our interactions more natural, intuitive, and, most importantly, human.

Also: 9 Business Use Cases of OpenAI’s ChatGPT

Sources and Further Reading

The post Building a Conversational Voice Bot with Azure OpenAI and Python: The Future of Human and Machine Interaction appeared first on relataly.com.

How to Automatize your Twitter News Account with OpenAI ChatGPT and NewsAPI in Python

Florian Follonier — Fri, 05 May 2023 22:15:20 +0000

It’s no secret that Large Language Models (LLMs) are a powerful tool for automating social media tasks. Not only can they be used to curate relevant content that matches your audience’s interests, but also can they create the content and tailor them to the interests of your customers. This article describes how to create a Twitter News Bot such as the one that now automates sharing news updates on the relataly Twitter channel. The bot uses the NewsAPI and OpenAI’s chatGPT 3.5 model to retrieve and share relevant news updates. This tutorial will equip you with the skills to build a similar Twitter News Bot in Python that provides news on your own Twitter account.

We’ll start by exploring NewsAPI and show you how to use it to fetch news articles from various sources. Then we’ll use OpenAI to enhance the bot’s capabilities. By refining the news selection process and generating engaging tweets using ChatGPT, you can ensure your updates are unique, informative, and captivating. We’ll provide clear explanations and code samples to help you succeed. By the end, you’ll have a powerful News Bot that can deliver curated news updates fully automated. But the best thing is that it is so efficient that you can run it in the cloud almost for free. So, let’s dive in and get started on creating an impressive NewsBot using OpenAI and NewsAPI in Python!

This is what the posts from the bot look like:

Tweets by relataly

The Relataly Twitter Account runs the Same News Bot described in this tutorial. It leverages a serverless cloud architecture based on Azure Functions.

Also relevant:

The Architecture of the Twitter NewsBot

Let’s begin with an overview of what the NewsBot looks like. The illustration below shows the architecture.

The architecture of the Relataly News Bot.

The modular architecture of the bot makes it easy to deploy as microservices to the cloud, just like relataly does with Azure functions. This approach also makes it simple to customize the bot by changing topics or defining a different style for the tweet creation process.

Fetching News

The bot is scheduled to run regularly, typically every hour. During each call, it retrieves the latest news articles from the NewsAPI. The NewsAPI allows filtering for specific categories:

business entertainment
general
health
science
sports
technology

The news bot only news related to “technology.” However, this is not specific enough, as we only want to retrieve news related to AI, data science, and Machine Learning. To address this, we rely on OpenAI to determine which news is relevant. ChatGPT provides us with a list that identifies the relevant news with True and irrelevant news with False. In addition, the bot tracks which news articles have already been shared in a CSV file. This ensures that the same news is not shared multiple times.

Creating News Tweets

Once we have the relevant news, the bot hands over the information to the second component responsible for creating the tweets. For each post, the bot makes an API call to OpenAI GPT, which generates a tweet based on a prompt that combines the article title, description, and URL. The resulting tweet is then sent to the Twitter API, which adds it to the account’s feed.

By automating the process of sharing news updates on Twitter, you can save time and focus on more important tasks. Additionally, using OpenAI to generate engaging tweets can help increase social media engagement and attract more followers.

Also:

Customization Options

You can easily customize the bot by changing the relevant topics or by defining another style in which OpenAI creates the tweets. By automating the process of sharing news updates on Twitter, you can save time and focus on more essential tasks. Additionally, using OpenAI to generate engaging tweets can help increase your social media engagement and attract more followers. In the next section, let’s look at the APIs used in this architecture and how you gain access.

Also: ChatGPT Style Guide: Understanding Voice and Tone Prompt Options for Engaging Conversations

APIs used in this Tutorial

To use the functionalities of our application, you will need to obtain the following API keys: NewsAPI, OpenAI API, and Twitter API. Without an API key, the API calls will lead to an error, so there is no way around signing up.

NewsAPI.org API

The NewsAPI provides access to over 30,000 news sources from around the world, including major news organizations such as CNN, BBC, and Reuters. We will walk you through how to specify criteria such as sources, keywords, and categories to retrieve relevant news articles. The free tier allows you to retrieve a certain number of news for free.

The NewsAPI key provides access to news articles and headlines from various sources. To obtain the API key, you can sign up for a NewsAPI account on their website at NewsAPI Registration and generate a unique key specifically for your application.

OpenAI API

The OpenAI API key is required to leverage the power of artificial intelligence and natural language processing provided by OpenAI. You can obtain an API key by signing up for an OpenAI account at OpenAI Registration and following their documentation to generate the key associated with your account.

While OpenAI provides powerful language models through its API, it is important to note that there is a cost associated with using the GPT model that depends on the amount of text and the model type you are using. In this tutorial, we will be using the chatGPT turbo model, which is highly cost-efficient. Because we only process tiny bits of text, each inference typically costs a fraction of a cent. However, the amount might increase if you adjust the code and process more text. Check the OpenAI pricing site for the latest price information.

To give you an idea, my monthly openai costs for running the bot are between less than 2 $:

Twitter Developer API

Finally, you will need to obtain Twitter API keys to interact with the Twitter API and perform actions such as posting tweets. These include the API key, API secret key, access token, and access token secret. You can obtain these keys by creating a Twitter Developer account at Twitter Developer Portal, setting up a new application, and generating the required keys within the Twitter Developer dashboard. It can take some days until your application gets approved, but the good thing is that a basic tier allows you to create up to 1500 tweets for free.

Once you have obtained these API keys, you can securely store them in a secure storage solution like Azure Key Vault, as shown in the code example. This ensures that your keys remain confidential and allows your application to access the necessary services seamlessly.

Also: Accessing Remote Data Sources via REST APIs in Python

The relataly Twitter news bot leverages and integrates the following APIs: NewsAPI, OpenAI, and Twitter.

Creating an OpenAI NewsBot for Twitter using NewsAPI in Python

Now that you are familiar with the architecture of the news bot, it’s time to roll up your sleeves and get into the coding. We will guide you through the necessary steps to create an OpenAI News Bot for Twitter using Python. By leveraging the advanced language model of OpenAI and the vast resources available through NewsAPI, we will develop a sophisticated bot that automatically fetches and tweets the latest news updates on Twitter. Our bot will go beyond basic news retrieval and have the ability to curate relevant news articles within a specific scope and generate compelling tweets that stand out in the noise of social media. Let’s get started!

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Update August 2023: If you are wondering what the exact code is for the automation of the relataly twitter account in Azure functions, I have published it in a separate GitHub repository. In the meantime, i have slightly adjusted the code and added functionalities for retrieveing articles from hackernews and posting fact tweets.

OpenAI, with its language models and natural language processing capabilities, can be a valuable tool in automating Social media tasks.

Setup and APIs

Before diving into the code, it’s essential to ensure that you have the proper setup for your Python 3 environment and have installed all the necessary packages. If you do not have a Python environment, follow the instructions in this tutorial to set up the Anaconda Python environment. This will provide you with a robust and versatile environment well-suited for machine learning and data science tasks.

In this tutorial, we will be working with the OpenAI library. You can install the OpenAI Python library using console commands:

pip install openai
conda install openai (if you are using the Anaconda packet manager)

Step #1 Imports and Authentication

We begin by making the necessary imports and setting up the API keys required to authenticate against the different APIs. In the example, below I have used the Azure key vault to store the API keys in a secure manner in the cloud. You can also load the API keys from a yaml file or store the API keys in the code (not advised for security reasons).

We retrieve and set API keys and authentication credentials for services like NewsAPI, OpenAI, and Twitter. These keys and credentials are essential for authenticating our application and ensuring secure access to the respective APIs. Additionally, we implement logging functionality to record important information about our program’s execution, enabling us to monitor and troubleshoot as needed.

In addition, we assign a variable for the CSV file name we’ll use to store our news log data.

By executing these steps, we lay the foundation for seamless integration with external APIs and services, enabling our application to perform a wide range of functionalities in subsequent parts of our code.

# A tutorial for this file is available at www.relataly.com
# Tested with Python 3.9.13, 1.2.0, Pandas 1.3.4, OpenAI 0.27.3, Tweepy 4.13.0, Requests 2.26.0, 

import logging
import pandas as pd
import openai
import tweepy
import csv
import requests

# Set API Keys and Authentification
logging.info('Setting NewsAPI API Key')
NEWSAPI_API_KEY =  # replace with own API key

logging.info('Setting OpenAI API Key')
openai.api_key =  # replace with own API key

logging.info('Setting Twitter API Key')
auth=tweepy.OAuthHandler(,
                         )
auth.set_access_token(,
                      )
twitter_api=tweepy.API(auth)

CSV_NAME = 'news_log.csv'

Step #2 Collecting the News from the NewsAPI

Next, we define a function to retrieve technology news from the NewsAPI.

The function below takes an optional parameter number to specify the desired number of news items. Inside the function, we construct a URL using the NewsAPI endpoint and our API key. We send a request to the URL and receive a JSON response containing the news data. Then, we extract the relevant information, such as the title, description, and URL, from the response.

Using the Pandas library, we create a DataFrame to organize the news items neatly. We filter out any rows with missing values, ensuring we have clean and complete data. Finally, we return the top number news items as a DataFrame.

By utilizing the fetch_news function, we can effortlessly access a collection of up-to-date technology news articles. This allows us to stay informed and integrate the latest news seamlessly into our application.

#### NewsAPI
def fetch_news(number=10):
    # Fetch tech news from NewsAPI
    url = f"https://newsapi.org/v2/top-headlines?country=us&category=technology&apiKey={NEWSAPI_API_KEY}"
    response = requests.get(url).json()
    news_items = response["articles"]
    df = pd.DataFrame(news_items)
    df = df[["title", "description", "url"]].dropna()
    return df.head(number)

Step #3 OpenAI Functions for News Relevance and Tweet Generation

We proceed with the OpenAI integration, which allows us to perform important tasks in our application. Firstly, we utilize the OpenAI GPT-3.5 engine to identify the relevance of news articles based on a specific scope of topics. By constructing prompts and leveraging the GPT-3.5 Turbo model, we can effectively determine the relevance of news articles to enhance the accuracy of our information. If you are interested in retrieving other news, you can easily adjust the prompt and include other topics that you deem relevant for your application.

In addition, we define a clause for the case that the bot does not identify any news as relevant. In this case, the bot will create a fact tweet that describes common terms from the domain of machine learning and data science.

Furthermore, we harness the power of OpenAI to create tweets uniquely and engagingly. By generating prompts that provide a title, description, and a tiny URL, we can generate informative tweets within the 280-character limit. This enables us to share compelling news content while utilizing hashtags to expand the reach of our tweets. As you can see in the function “select_relevant_news_prompt” code, we provide a single sample response to OpenAI ChatGPT. This approach is known as 1-shot learning, and it typically significantly improves the results compared to a prompt without any samples (zero-shot learning).

In the second function, “create_tweet_prompt,” we do not provide any sample response to minimize the number of tokens processed and lower our cost for calling the model. This approach is known as zero-shot learning because the model is not given any sample.

#### OpenAI Engine
def openai_request(instructions, task, sample = [], model_engine='gpt-3.5-turbo'):
    prompt = [{"role": "system", "content": instructions }, 
              {"role": "user", "content": task }]
    prompt = sample + prompt
    completion = openai.ChatCompletion.create(model=model_engine, messages=prompt, temperature=0.2, max_tokens=400)
    return completion.choices[0].message.content


#### Define OpenAI Prompt for news Relevance
def select_relevant_news_prompt(news_articles, topics, n):    
    instructions = f'Your task is to examine a list of News and return a list of boolean values that indicate which of the News are in scope of a list of topics. \
    Return a list of True or False values that indicate the relevance of the News.'
    task =  f"{news_articles} /n {topics}?" 
    sample = [
        {"role": "user", "content": f"[new AI model available from Nvidia, We Exploded the AMD Ryzen 7, Release of b2 Game, XGBoost 3.0 improvices Decision Forest Algorithms, New Zelda Game Now Available, Ukraine Uses a New Weapon] /n {topics}?"},
        {"role": "assistant", "content": "[True, False, False, True, False, False]"},
        {"role": "user", "content": f"[Giga Giraff found in Sounth Africa, We Exploded the AMD Ryzen 7, Release of b2 Game, Donald Trump to make a come back, New Zelda Game Now Available, Ukraine Uses a New Weapon] /n {topics}?"}, 
        {"role": "assistant", "content": "[False, False, False, False, False, False]"}]
    return instructions, task, sample


#### Define OpenAI Prompt for news Relevance
def check_previous_posts_prompt(title, old_posts):    
    instructions = f'Your objective is to compare a news title with a list of previous news and determine whether it covers a similar topic that was already covered by a previous title. \
        Rate the overlap on a scale between 1 and 10 with 1 beeing a full overlap and 10 representing an unrelated topic. "'
    task =  f"'{title}.' Previous News: {old_posts}."
    sample = [
        {"role": "user", "content": "'Nvidia launches new AI model.' Previous News: [new AI model available from Nvidia, We Exploded the AMD Ryzen 7 7800X3D, The Lara Croft Collection For Switch Has Been Rated By The ESRB]."},
        {"role": "assistant", "content": "1"},
        {"role": "user", "content": "'Big Explosion of an AMD Ryzen 7.' Previous News: [Improving Mental Wellbeing Through Physical Activity, The Lara Croft Collection For Switch Has Been Rated By The ESRB]."},
        {"role": "assistant", "content": "10"},
        {"role": "user", "content": "'new AI model available from Google.' Previous News : [new AI model available from Nvidia, The Lara Croft Collection For Switch Has Been Rated By The ESRB]."},
        {"role": "assistant", "content": "9"},
        {"role": "user", "content": "'What Really Made Geoffrey Hinton Into an AI Doomer - WIRED.' Previous News : [Why AI's 'godfather' Geoffrey Hinton quit Google, new AI model available from Nvidia, The Lara Croft Collection For Switch Has Been Rated By The ESRB]."},
        {"role": "assistant", "content": "4"}]
    return instructions, task, sample


#### Define OpenAI Prompt for News Tweet
def create_tweet_prompt(title, description, tiny_url):
    instructions = f'You are a twitter user that creates tweets with a maximum length of 280 characters.'
    task = f"Create an informative tweet on twitter based on the following news title and description. \
        The tweet must use a maximum of 280 characters. \
        Include the {tiny_url}. But do not include any other urls.\
        Title: {title}. \
        Description: {description}. \
        Use hashtags to reach a wider audience. \
        Do not include any emojis in the tweet"
    return instructions, task


#### Define OpenAI Prompt for news Relevance
def previous_post_check(title, old_posts):
    instructions, task, sample = check_previous_posts_prompt(title, old_posts)
    response = openai_request(instructions, task, sample)
    return eval(response)


#### Define OpenAI Prompt for News Tweet
def create_fact_tweet_prompt(old_terms):
    instructions = f'You are a twitter user that creates tweets with a length below 280 characters.'
    task = f"Choose a technical term from the field of AI, machine learning or data science. Then create a twitter tweet that describes the term. Just return a python dictionary with the term and the tweet. "
    # if old terms not empty
    if old_terms != []:
        avoid_terms =f'Avoid the following terms, because you have previously tweetet about them: {old_terms}'
        task = task + avoid_terms
    sample = [
        {"role": "user", "content": f"Choose a technical term from the field of AI, machine learning or data science. Then create a twitter tweet that describes the term. Just return a python dictionary with the term and the tweet."},
        {"role": "assistant", "content": "{'GradientDescent': '#GradientDescent is a popular optimization algorithm used to minimize the error of a model by adjusting its parameters. \
         It works by iteratively calculating the gradient of the error with respect to the parameters and updating them accordingly. #ML'}"}]
    return instructions, task, sample

# Load previous information from a csv file
def get_history_from_csv(csv_name):
    try:
        # try loading the csv file
        df = pd.read_csv(csv_name)
    except:
        # create the csv file
        df = pd.DataFrame(columns=['title'])
        df.to_csv(csv_name, index=False)
    return df

Step #4 Functions for Publishing Twitter Tweets

Next, we will create a series of functions that enable us to create and post tweets based on specific news articles. The first function, check_tweet_length, checks the length of a tweet and ensures it does not exceed the 280-character limit. If the tweet is too long, it returns False; otherwise, it returns True.

Given the character limit of Twitter tweets at only 280 characters, including a URL can often take up a significant amount of that limit. To avoid wasting precious characters on lengthy URLs, we can use an URL-shortening service called TinyURL. This service provides a shortened version of any input URL, allowing us to fit more text within the character limit. The good news is that TinyURL offers an API that we can use without the need for an API key.

The create_tweet function takes the title, description, and URL of a news article as inputs. It generates a tiny URL using the create_tiny_url function and constructs a prompt for tweet creation using the create_tweet_prompt function. The prompt is then sent to the OpenAI engine to generate the tweet content. The generated tweet is checked for length using the check_tweet_length function, and if it passes the length check, it is posted using the Twitter API.

By utilizing these functions together, we can effectively create and post tweets based on news articles while ensuring they meet the length requirements and include relevant information for our audience. This streamlines the process of sharing news updates and engaging with our followers in a concise and informative manner.

def check_tweet_length(tweet):
    return False if len(tweet) > 280 else True


# Create the fact tweet
def create_fact_tweet(chance_for_tweet = 0.5):
    df_old_facts = get_history_from_csv(CSV_FACTS_NAME)

    if random.random() < chance_for_tweet:
        # create a fact tweet
        instructions, tasks, sample = create_fact_tweet_prompt(list(df_old_facts.tail(10)['title']))
        tweet = openai_request(instructions, tasks, sample)
        tweet_text = list(eval(tweet).values())[0]

        # tweet creation
        print(f'Creating fact tweet: {tweet_text}')
        
        # check tweet length and post tweet
        if check_tweet_length(tweet):
            twitter_auth().update_status(tweet_text)
            term = list(eval(tweet).keys())[0]
            # save the fact in the csv file
            with open(f'{CSV_FACTS_NAME}', 'a', newline='') as file:
                            writer = csv.writer(file)
                            writer.writerow([term])
        else: 
            print('error tweet too long')
    else:
        print('No fact tweet created')
    

def create_news_tweet(title, description, url):
    # create tiny url
    tiny_url = create_tiny_url(url)

    # define prompt for tweet creation
    instructions, task = create_tweet_prompt(title, description, tiny_url)
    tweet_text = openai_request(instructions, task)

    print(f'Creating new tweet: {tweet_text}')
    # check tweet length and post tweet
    if check_tweet_length(tweet_text):
            status = twitter_auth().update_status(tweet_text)
            print(f"Tweeted: {title}")
            with open(f'{CSV_NEWS_NAME}', 'a', newline='') as file:
                writer = csv.writer(file)
                writer.writerow([title])
    else: 
        status = 'error tweet too long'
    return status


def create_tiny_url(url):
    response = requests.get(f'http://tinyurl.com/api-create.php?url={url}')
    return response.text

Step #5 Bringing it All Together – Running the Bot

Finally, we reach the exciting part where we can run our Twitter news bot. The code below utilizes the functions we’ve created earlier to compose a tweet and post it on Twitter.

First, we call the create_tweet(title, description, url) function, which takes the title, description, and URL of a news article. This function generates a shorter URL and prepares a prompt for the tweet. Next, we generate the actual content of the tweet by using the prompt with the help of the openai_request(instructions, task) function.

Once the tweet is generated and assigned to a variable called tweet, we check its length using the check_tweet_length(tweet) function. This step ensures that the tweet does not exceed the maximum character limit allowed by Twitter. If the tweet is within the allowed limit, we post it on Twitter using the twitter_api.update_status(tweet) function. We also log the ID of the posted tweet for reference.

If the tweet exceeds the character limit, we assign an error message to the status variable.

Finally, the function returns the status variable, which can be used to analyze any potential issues or handle errors.

This code segment brings everything together, allowing our Twitter news bot to automatically create and publish tweets based on news articles. You can wrap this function with a timer function to run the code on a regular interval.

#### Main Bot
def main_bot():
    # Read the old CSV data
    # try opening the csv file and creeate it if it does not exist
    df_old_news = get_history_from_csv(CSV_NEWS_NAME)
    df_old_news = df_old_news.tail(16)
    # Fetch news data
    df = fetch_news(12)
    
    # Check the Relevance of the News and Filter those not relevant
    relevant_topics ="[AI, Machine Learning, Data Science, OpenAI, Artificial Intelligence, Data Mining, Data Analytics]"
    instructions, task, sample = select_relevant_news_prompt(list(df['title']), relevant_topics, len(list(df['title'])))
    relevance = openai_request(instructions, task, sample)
    relevance_list = eval(relevance)

    s = 0
    df = df[relevance_list]
    if len(df) > 0:
        for index, row in df.iterrows():
            if s == 1:
                break
            logging.info('info:' + row['title'])
            title = row['title']
            title = title.replace("'", "")
            description = row['description']
            url = row['url']            
                                             
            if (title not in df_old_news.title.values):
                doublicate_check = previous_post_check(title, list(df_old_news.tail(10)['title']))
                if doublicate_check > 3:
                    # Create a tweet
                    response = create_news_tweet(title, description, url)
            
                else: 
                    print(f"Already tweeted: {title}")
            else: 
                print("No news articles found")
                create_fact_tweet(chance_for_tweet=0.5)

main_bot()

No news articles found
No fact tweet created
No news articles found
Creating fact tweet: Looking for a way to cluster your data? Try #KMeans! It is an unsupervised learning algorithm that groups similar data points together based on their distance to a centroid. #MachineLearning #DataScience
Creating tweet: {'KMeans': 'Looking for a way to cluster your data? Try #KMeans! It is an unsupervised learning algorithm that groups similar data points together based on their distance to a centroid. #MachineLearning #DataScience'}

Hosting the Bot Serverless on Azure Functions

If you want to run the bot fully automated, you will probably want to host it somewhere in the cloud. To host the OpenAI newsbot on Azure, you can follow these steps:

Create an Azure Key Vault to store the credentials securely.
Create an Azure Blob Storage account to keep track of past posts.
Create two Azure Functions. The first function fetches news on a regular interval and checks that only relevant news is posted. This function also ensures that there are no duplicates with the help of OpenAI and a CSV file that keeps track of previous posts. The second function is an HTTP trigger and is used to create the post. It reaches out to OpenAI to generate the post and then makes the call to the Twitter API.
Once the functions are created, deploy the code to Azure Functions.
Configure the Azure Functions to use the credentials from the Key Vault and the Blob Storage account to store the CSV file.

The code for the newsbot is only slightly different from what was presented in the blog post. If you’re interested in the Azure-ready code, feel free to message me.

The Relataly News Bot runs on a Serverless Infrastructure based on Azure Functions

Summary

In this article, we have developed a Twitter news bot that automates news updates on the platform. Our focus was on leveraging OpenAI’s advanced language model to create a powerful bot capable of generating and posting engaging tweets based on news articles. Throughout the discussion, we covered essential aspects such as tweet length validation, concise URL generation, crafting captivating tweet prompts, and seamless integration with the Twitter API for efficient posting.

By following the step-by-step guide presented in this blog, readers were able to harness the capabilities of OpenAI technology and successfully build their own Twitter news bots. This enabled them to effortlessly deliver timely updates to their audience, expanding their reach and fostering meaningful engagement within the Twitter community.

Whether you have questions, suggestions, or insights, let us know in the comments below. Your feedback is invaluable to us as we strive to continuously improve.

Sources

Twitter Developer Portal
OpenAI Portal
NewsAPI Portal
Overview Azure Functions
Tinyurl.com
ChatGPT helped to revise this article.
Images created with Midjourney AI

The post How to Automatize your Twitter News Account with OpenAI ChatGPT and NewsAPI in Python appeared first on relataly.com.

Using Fairlearn to Build Fair Machine Machine Learning Models with Python: Step-by-Step Towards More Responsible AI

Florian Follonier — Thu, 09 Mar 2023 23:45:16 +0000

As we enter an era where intelligent systems are increasingly relied upon to make key decisions, responsible AI has become more critical than ever before. It’s not enough to simply rely on data-driven decision-making; we must also ensure that these systems are fair and just. But how can we do this when the same bias and unfairness that exists in our society is reflected in these very algorithms?

The consequences of biased algorithms are far-reaching and can be devastating. They can perpetuate discriminatory practices, deny individuals opportunities or services, and lead to a range of negative outcomes. It’s no longer just a moral obligation to avoid perpetuating inequalities; it’s a necessity to prevent negative consequences such as reputational damage and legal repercussions.

This article aims to drive home the importance of fairness in AI and provide practical tips for ensuring that algorithms are unbiased and free from discrimination. We will delve into the fairlearn library, a powerful open-source toolkit, to assess and mitigate bias in machine learning. Using fairlearn, we will implement a classification model in Python to predict income on the adult consensus set and evaluate its fairness.

Think about it: just as we strive to ensure fairness in our daily lives, we must also do the same with AI. We must do our best to eliminate bias and discrimination from our algorithms, just as we work to eliminate these issues from our society. By using fairlearn, we can ensure that our models are as fair and unbiased as possible, minimizing the risk of perpetuating discrimination and inequality. We can take steps to ensure that our algorithms are not influenced by factors such as race, gender, or age, and that they are making predictions based on merit alone.

What is Responsible AI?

Responsible AI is an approach that emphasizes fairness, transparency, accountability, and ethical considerations in the development and deployment of AI systems. It involves identifying potential sources of bias in AI and taking steps to address them proactively. To ensure responsible AI, organizations must develop policies and processes that prioritize fairness, transparency, and accountability throughout the AI development lifecycle.

How Real-World Bias Manifests in Biased Data

As we continue to develop increasingly sophisticated AI models like GPT-4, it is crucial to acknowledge that real-world data is not always objective and unbiased. Training models on biased data can lead to the perpetuation and even amplification of existing biases and disparities. Failing to address these issues can cement and exacerbate existing inequalities, with far-reaching consequences for society and individuals alike.

Bias in AI models often stems from the data they are trained on. Historical data, social norms, and cultural practices can all introduce bias into datasets, which AI models may then unwittingly learn and propagate. This can lead to AI systems making biased decisions, producing biased content, or perpetuating stereotypes and prejudices.

Ways in which AI systems can be biased

There are many ways in which AI systems can be biased and thus behave unfairly, including the following:

Gender Bias in Hiring: The data used by companies to train their hiring algorithms may be biased against certain genders or demographics. As a result, these algorithms may perpetuate existing gender inequalities by disproportionately rejecting female candidates.
Racial Bias in Healthcare: Medical data may be biased against certain racial or ethnic groups, which can lead to misdiagnosis, mistreatment, or under-treatment of certain populations. For instance, studies have shown that Black patients are less likely to receive appropriate pain management than white patients.
Economic Bias in Credit Scoring: Financial data used to determine credit scores may be biased against low-income individuals or marginalized communities, making it harder for them to access credit or loans. The result is a cycle of poverty, where individuals cannot access credit and are thus unable to build wealth.
Age Bias in Predictive Policing: Predictive policing algorithms may be biased against certain age groups, such as younger people, who may be more likely to be falsely flagged as potential criminals. This can lead to increased surveillance and potential harassment of certain populations.

These examples highlight the importance of identifying and addressing bias in real-world data. By doing so, we can ensure that our models and algorithms are fair, just, and equitable for all individuals, regardless of their background or demographic.

One of many biases is that in many industries, Women still earn less than their male colleagues. Image created with Midjourney.

Model Fairness and Performance

In machine learning models, there is often a delicate balance between fairness and performance. For instance, optimizing a model for high accuracy might inadvertently lead to discrimination against certain groups of people. To avoid these adverse effects, we must actively work to reduce the influence of unfairness and bias in our models.

Achieving this balance involves multiple strategies, such as gathering diverse and representative data, identifying and addressing sources of bias within the data, and devising techniques to counteract the effects of bias. Moreover, it’s essential to consistently monitor and evaluate our models to ensure they maintain fairness and impartiality over time.

Conversely, when prioritizing fairness, some degree of accuracy may be sacrificed. Striking the right balance between fairness and performance depends on the specific use case and consideration of the potential consequences of any decisions made. The ultimate goal is to create a model that is both fair and effective in performing the task it was designed for.

In summary, to create a well-balanced machine learning model, it’s crucial to acknowledge the trade-off between fairness and performance. By employing various strategies, such as using diverse data, addressing bias, and continuously monitoring the models, we can work towards developing models that are not only accurate but also fair and equitable for all users.

Organizations must take the necessary steps to ensure their algorithms are unbiased and not perpetuate the same inequalities still plaguing our society. Image created with Midjourney.

Assessing Model Fairness with the Fairlearn Library

Fairlearn is an open-source Python package, developed by Microsoft, that offers tools for assessing and mitigating unfairness in machine learning models. The library empowers developers and data scientists to pinpoint and address potential sources of bias in their data and models, ensuring that predictions and decisions are made fairly and transparently.

Fairlearn encompasses a variety of algorithms and metrics for evaluating and mitigating unfairness, including:

Group Fairness: This metric measures the difference in a model’s performance across various subgroups of the population, such as race or gender.
Demographic Parity: This metric works to ensure that the model’s predictions are distributed fairly across different subgroups of the population.
Equalized Odds: This metric guarantees that the model’s predictions maintain equal accuracy across different subgroups of the population.

In addition to these metrics, Fairlearn also offers various techniques for mitigating unfairness. These include reweighing data to balance different subgroups, adjusting the model’s predictions to meet specific fairness criteria, and enforcing fairness constraints during the training process. By utilizing Fairlearn, developers and data scientists can be confident that their machine learning models are more equitable and unbiased.

Assessing Model Fairness with the Fairlearn Library. Image created with Midjourney.

Mitigating Unfairness in Machine Learning

There are two options to address unfairness in machine learning models: consider fairness during training or add a post-processing step to mitigate unfairness as an additional layer. Both methods have their merits and drawbacks. The ideal approach depends on the context and objectives of the problem you’re tackling.

Considering fairness during training involves incorporating fairness constraints into the learning process. This approach can lead to more interpretable models, as fairness is embedded from the start. However, it may require specialized algorithms and could result in reduced performance.

On the other hand, adding a post-processing step to mitigate unfairness involves adjusting the model’s predictions after training. This method offers flexibility, as it can be applied to any pre-trained model. It may also preserve performance better. However, it might not provide as much interpretability, as fairness is addressed separately from the original model.

Ultimately, the choice depends on your specific goals, the model’s purpose, and the importance of interpretability versus performance.

Considering Fairness during Training with Fairness Constraints

Considering fairness during training involves incorporating fairness constraints or objectives into the model’s training process. This can be done in different ways:

by balancing the training dataset to ensure equal representation of different groups.
by adjusting the model’s loss function to penalize unfair predictions,
or by adding fairness constraints to the optimization problem.

The main advantage of this approach is that it can lead to models that are inherently fair without requiring any additional post-processing steps. However, it can be challenging to define and operationalize fairness. In addition, the fairness objectives may conflict with other objectives, such as accuracy or generalization.

There are several fairness constraints that can be used to ensure that machine learning models and decision-making processes do not discriminate against certain groups. Some of the most commonly used constraints include:

Demographic Parity: This constraint ensures that the proportion of positive outcomes is equal across different demographic groups.
Equalized Odds: This constraint ensures that the true positive rate and false positive rate are equal across different demographic groups.
Disparate Impact: This constraint ensures that the ratio of positive outcomes to the total number of outcomes is the same across different demographic groups.

For a full list of constraints, please view the library documentation.

The choice of fairness constraint depends on the specific context and goals of the decision-making process, and the constraints may need to be customized or combined to achieve the desired level of fairness.

Fairness Mitigation as a Postprocessing Layer

Another approach is to add a post-processing step to mitigate the unfairness of existing models. This method involves applying a fairness algorithm to the model’s outputs after training. Examples include reweighting predictions to ensure equal treatment of different groups, calibrating the model’s scores to remove bias, or using a fairness metric to adjust predictions.

The primary advantage of this approach is its applicability to existing models without the need for retraining. However, it might not tackle the root causes of unfairness in the model, and the fairness algorithm could introduce its own biases or trade-offs.

We can consider fairness towards sensitive features during the training of a machine learning model by using the fairness constraints. Image created with Midjourney.

Building A More Inclusive Machine Learning Model using FairLearn and Python

Let’s explore building a fair machine learning model using the census dataset, which includes demographic attributes to predict if a person earns more or less than $50k per year. As income prediction is sensitive, we’ll ensure fairness in our model.

We’ll create an initial fairness-unaware model, evaluate its fairness, and then use grid search to identify alternative models with varying performance and unfairness. This process consists of:

Load the data: Load the census dataset containing demographic attributes like age, education level, race, and gender.
Initial preprocessing and visualization: Preprocess the data by removing missing values, encoding categorical variables, scaling numerical features, and visualizing the data for insights.
Splitting and scaling the data: Split the data into training and testing sets and scale the features to make them comparable.
Training a fairness-unaware model and assessing fairness: Train a baseline model unaware of unfairness and evaluate its fairness using metrics like disparate impact and statistical parity difference.
Mitigating bias by training fairness-aware models with fairness constraints: Train models using demographic parity as a fairness constraint to mitigate bias and improve fairness.
Comparing the performance of the models: Compare the models’ performance in accuracy, fairness, and other metrics like false positives and false negatives.

By following these steps, we’ll build a fair machine learning model that is accurate, equitable, and just.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Responsible AI, and fairness in machine learning will gain further importance as the role of this exciting technology continues to become more influential in society.

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages. This will ensure a seamless learning experience and prevent any potential roadblocks or issues that may arise due to an improperly configured environment.

If you don’t have an environment, follow this tutorial to set up the Anaconda environment.

Make sure you install all required packages. In this tutorial, we will be working with the following packages:

Pandas
NumPy
Matplotlib
Seaborn
Plotly
Fairlearn

You can install the Fairlearn library and other packages using console commands:

pip install 
conda install  (if you are using the anaconda packet manager)

About the Adult Consensus Dataset

The Adult Consensus Dataset is a highly regarded and extensively used resource in machine learning and statistics, offering a comprehensive view of over 32,000 individuals across the United States. This invaluable dataset features 14 unique attributes, such as age, education, occupation, work class, marital status, and gender, among others. It also includes a binary target variable, identifying individuals earning over $50,000 annually.

However, the dataset presents challenges, as it contains sensitive demographic features like race and gender. If not handled properly, these features may lead to biased predictions and unfair outcomes. As a result, models trained on this dataset require careful design and evaluation to prevent perpetuating or amplifying existing biases and discrimination.

Despite its complexities, the Adult Consensus Dataset remains a popular choice for training and testing binary classification models. Its value extends further, serving as an excellent benchmark for testing and showcasing fair AI models. The dataset raises important questions about fairness, bias, and discrimination in machine learning, making it an essential resource for creating ethical and unbiased AI solutions.

Step #1 Load Preprocess the Data

We begin by loading the data using the fetch_openml function, making the necessary imports, and performing some initial preprocessing. The dataset is loaded into two data frames, X_df and y_df, where X_df contains the features and y_df contains the binary target variable, which is income with classes “>50k” and “<=50k”. The code below encodes these classes with 0 and 1.

Assessing model fairness is always done with respect to one or several sensitive variables. Although there are several features in the dataset that could be considered sensitive, optimizing the model to consider all of these features simultaneously would lead to significant complexity and increase training times. Therefore, in this tutorial, we will focus specifically on assessing model fairness with respect to sex and race. The sex feature contains two classes, “male” and “female,” while the race feature contains classes for “Black,” “White,” and several further classes.

# A tutorial for this file will be shortly available at www.relataly.com
# Tested with python 3.9.13, matplotlib 3.6.2, numpy 1.23.3, seaborn=0.12.1, fairlearn=0.8.0, plotly 5.11.0

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn import metrics as skm
from fairlearn.metrics import MetricFrame, selection_rate, selection_rate_difference, count, plot_model_comparison
from fairlearn.reductions import DemographicParity, ErrorRate, GridSearch
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})

# read the adult data
# predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.
# https://archive.ics.uci.edu/ml/datasets/adult

from sklearn.datasets import fetch_openml
data = fetch_openml(data_id=1590, parser='auto', as_frame=True)

X_df = data.data.copy()
data.target = data.target.replace({' <=50K': 0, ' >50K': 1})
y_df = data.target

# Declare sensitive features
sensitive_variables = ["sex", 'race']

# join X_df and y_df and name y_df as 'income'
X_df['salary >50k'] = y_df

# show df that contains the target column
X_df.head()

	age	workclass	fnlwgt	education	race	sex		...		hours-per-week	salary >50k
0	25	4			226802	1			Black	Male	...		40				0
1	38	4			89814	11			White	Male	...		50				0
2	28	2			336951	7			White	Male	...		40				1
3	44	4			160323	15			Black	Male	...		40				1
4	18	0			103497	15			White	Female	...		30				0

Step #2 Initial Preprocessing and Visualization

Let’s continue with some initial preprocessing and data visualization. First, we encode the categorical features in the X_df data frame using the LabelEncoder function from scikit-learn. This is important, as the fairlearn gridsearch technique that we will use in this tutorial only works with numeric features.

Next, to simplify the tutorial and reduce model training time, we will filter the dataset to only include people of races “Black” and “White.” This will result in four sensitive feature permutations:

Black, Woman
Black, Man
White, Woman
White, Man

Finally, we create two data frames. The X_df data frame contains the features we will be using to train our model, but it does not contain the target label anymore. The other data frame, df_visu, does contain the target label and will be used for visualization purposes. Finally, we use df_visu to visualize how the target variable is distributed with respect to our sensitive features.

# encode the categorical features
X_df['workclass'] = X_df['workclass'].cat.codes
X_df['education'] = X_df['education'].cat.codes
X_df['marital-status'] = X_df['marital-status'].cat.codes
X_df['occupation'] = X_df['occupation'].cat.codes
X_df['relationship'] = X_df['relationship'].cat.codes

# filter some values to reduce tutorial complexity and make it easier to visualize
X_df = X_df.loc[X_df["race"].astype(str).isin([" White", " Black"])]
X_df.race = X_df.race.cat.remove_unused_categories()

# ensure that the target column has the same filter applied
y_df = X_df['salary >50k']

# create a copy of the df for visualization
df_visu = X_df.copy()

# drop the target column from X_df
X_df.drop(['salary >50k'], axis=1, inplace=True)

# visualize the distributions of the cateogical variables
fig, ax = plt.subplots(1, 3, figsize=(16, 5))
sns.countplot(x='sex', hue='salary >50k', data=df_visu, ax=ax[0])
sns.countplot(x='race', hue='salary >50k', data=df_visu, ax=ax[1])
sns.kdeplot(x='age', hue='salary >50k', data=df_visu, ax=ax[2])

# rotate x_labels
plt.sca(ax[1])
plt.xticks(rotation=30)

All three plots reflect inequality in our society:

The percentage of women with a yearly income above 50k is much lower than that of men.
With respect to race, this bias is even more extreme, with way more white people earning more than 50k than black people.
Although we won’t use age as a sensitive label, it is noteworthy that the percentage of people earning more than 50k declines with growing age.

Step #3 Splitting and Scaling the Data

The code performs several preprocessing steps. First, the data is split into the features (X_df) and the target variable (y_df). Next, the categorical features in X_df are encoded using the LabelEncoder function from scikit-learn. The data is then split into training and test sets. Records are stratified based on the target variable to ensure that the proportion of each class is maintained in both sets.

It is important to note that the sensitive variables, sex and race, are included in the splitting process to ensure that the data is divided in a fair manner. Specifically, we create a new data frame A containing only the sensitive variables. We then split the data into training and test sets using the stratify parameter based on Y_encoded (the encoded version of y_df) and A.

Finally, the features in X_df are scaled using the StandardScaler function from scikit-learn to ensure that all features are on the same scale. The resulting scaled features are then converted back into a pandas data frame, X_scaled.

# create a dataframe with sensitive features
A = X_df[sensitive_variables]
X = X_df.drop(labels=sensitive_variables, axis=1)
X = pd.get_dummies(X)

# Scale the data
sc = StandardScaler()
X_scaled = sc.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

# Encode the target variable
le = LabelEncoder()
Y_encoded = le.fit_transform(y_df)

# We split the data into training and test sets:
X_train, X_test, Y_train, Y_test, A_train, A_test = train_test_split(
    X_scaled, Y_encoded, A, test_size=0.4, random_state=0, stratify=Y_encoded
)

# Work around indexing bug
X_train = X_train.reset_index(drop=True)
A_train = A_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
A_test = A_test.reset_index(drop=True)

Step #4 Train a Fairness-Unaware Model and Assess its Fairness

Now it’s time to train our first model. This initial model will be entirely fairness unaware, meaning its predictions will reflect any patterns of unfairness present in the data.

Model training

We’ll begin by training a baseline fairness unaware decision tree classifier with default settings. The model will be trained on the training set (X_train and Y_train) and then tested on the test set (X_test and Y_test).

# Define evaluation metrics for performance and fairness
performance_metric = skm.accuracy_score
fairness_metric = selection_rate_difference

unmitigated_predictor = DecisionTreeClassifier(random_state=0)
unmitigated_predictor.fit(X_train, Y_train)

Evaluation and Visualization

To evaluate the model’s performance, we’ll use two crucial metrics: the accuracy score and the “selection rate difference”. The latter is a vital fairness metric that measures the difference between selection rates of two distinct groups for a specific outcome or prediction.

In machine learning, this metric often helps determine the fairness of a model, especially with sensitive features like race, gender, or age. A selection rate difference of zero indicates that the model is making predictions fairly and without bias, as both groups have equal selection rates.

However, we must pay attention to a positive or negative selection rate difference. It implies that the model is favoring one group over the other, which could signal bias or discrimination. Carefully monitoring this metric is critical to ensure the model is performing fairly and justly.

Lastly, we’ll visualize the results in a bar chart using the accuracy and selection rate by group. This visualization will provide a comprehensive view of the model’s performance and fairness, allowing us to make informed decisions about potential improvements or adjustments needed to strike the right balance between accuracy and fairness.

metric_frame = MetricFrame(
    metrics={
        "accuracy": performance_metric,
        "selection_rate": fairness_metric,
        "count": count,
    },
    sensitive_features=A_test,
    y_true=Y_test,
    y_pred=unmitigated_predictor.predict(X_test),
)

# plot the metrics
print(metric_frame.overall)
print(metric_frame.by_group)
metric_frame.by_group.plot.bar(
    subplots=True,
    layout=[1, 3],
    legend=False,
    figsize=[12, 4],
    title="Accuracy and selection rate by group",
)

accuracy              0.854136
selection_rate        0.197373
count             18579.000000
dtype: float64
                accuracy  selection_rate    count
sex     race                                     
 Female  Black  0.948665        0.053388    974.0
         White  0.921083        0.083873   5246.0
 Male    Black  0.877378        0.161734    946.0
         White  0.813371        0.264786  11413.0
array([[,
        ,
        ]],
      dtype=object)

When we look at the selection rate, we can see that the model is intrinsically biased. To be explicit, if this model would go into production, a Black Woman would have a significantly lower chance (5%) to be attributed a +50k income than a White Man (above 25%), even if all other attributes (age, education, and so on) are the same.

Step #5 Train Fairness-Aware Models with Gridsearch

We perform a grid search with demographic parity as a fairness constraint to find the best decision tree classifier model that is both accurate and fair. We evaluate different models with different hyperparameters for the decision tree classifier. The best predictors with the lowest error rates and lowest disparities (in terms of demographic parity) are selected as the dominant models.

We call these models “dominant” because they stand out as superior solutions in multi-objective optimization. “Dominant” implies that these models outshine others across multiple evaluation metrics without falling short in any of the metrics. Dominant models are preferred for their optimal balance of performance and fairness.

During model training, we’ll incorporate demographic parity as a fairness constraint. Demographic parity seeks to ensure equal proportions of positive outcomes (e.g., being hired, approved for a loan, etc.) across various demographic groups (e.g., gender, race, or age) in a population. In simpler terms, it requires the decision-making process not to discriminate against any group based on their demographic characteristics. Mathematically, it’s expressed as P(prediction=1|sensitive_feature=0) = P(prediction=1|sensitive_feature=1), where sensitive_feature is a binary variable indicating membership in a specific demographic group. Achieving demographic parity helps reduce discrimination and promote fairness in decision-making processes. However, depending on the context and goals of the decision-making process, demographic parity might not always be the most appropriate fairness constraint.

Depending on the hardware you are using, training can take several minutes.

# Define and run GridSearch with DemographicParity 
sweep = GridSearch(
    DecisionTreeClassifier(random_state=0),
    constraints=DemographicParity(), # DemographicParity() is the default
    grid_size=100, # number of models to evaluate
)
sweep.fit(X_train, Y_train, sensitive_features=A_train)

# The best predictor is the one with the lowest error rate
predictors = sweep.predictors_

errors, disparities = [], []
for p in predictors:
    
    def classifier(X):
        return p.predict(X)

    # Calculate the error rate of the classifier
    error = ErrorRate()
    error.load_data(X_train, pd.Series(Y_train), sensitive_features=A_train)
    errors.append(error.gamma(classifier)[0])

    # Calculate the disparity in the predictions
    disparity = DemographicParity()
    disparity.load_data(X_train, pd.Series(Y_train), sensitive_features=A_train)
    disparities.append(disparity.gamma(classifier).max())

# Consolidate the results into a single dataframe
grid_results = pd.DataFrame(
    {"predictor": predictors, 
     "error": errors, 
     "disparity": disparities}
)

# Loop through the grid and select the relevant models 
non_dominated_models = []
for model in grid_results.itertuples():
    # only select models with lower disparity compared to the other models
    if model.error < grid_results["error"][grid_results["disparity"] < model.disparity].min():
        non_dominated_models.append(model.predictor)

That’s it; now we have trained several fairness-aware models. Next, let’s look at the performance and fairness of these models.

Step #6 Comparing Dominant and Unmitigated Models

Finally, we’ll evaluate and compare our models’ performance. We’ll assess the dominant models and the unmitigated model using evaluation metrics for performance and fairness. The performance metric we’ll use is the accuracy score, and for fairness, we’ll use the selection rate difference. To do this, we’ll predict on the test data using the unmitigated model, then add predictions for all relevant dominant models.

Next, we’ll create a scatterplot for the models based on their accuracy and selection rate. We’ll use the x-axis for performance metrics and the y-axis for fairness metrics. We’ll also include point labels and display the plot. This visualization will help us understand the trade-offs between performance and fairness in our models.

# We can evaluate the dominant models along with the unmitigated model.

# Use the unmitigated model to predict on the test data
predictions = {"unmitigated_model": unmitigated_predictor.predict(X_test)}

# Add predictions for all relevant dominant models
metric_frames = {"unmitigated_model": metric_frame}
for i in range(len(non_dominated_models)):
    key = "dominant_model_{0}".format(i)
    predictions[key] = non_dominated_models[i].predict(X_test)

    metric_frames[key] = MetricFrame(
        metrics={
            "accuracy": performance_metric,
            "selection_rate": selection_rate,
            "count": count,
        },
        sensitive_features=A_test,
        y_true=Y_test,
        y_pred=predictions[key],
    )

# scatterplot for the models along their accuracy and selection rate
plot_model_comparison(
    x_axis_metric=performance_metric,
    y_axis_metric=fairness_metric,
    y_true=Y_test,
    y_preds=predictions,
    sensitive_features=A_test,
    point_labels=True,
    show_plot=True,
)

Upon examination, it becomes apparent that we are presented with multiple dominant models that have a marginal difference in their selection rates as well as accuracy levels. These models, while exhibiting slightly lower accuracy, are less influenced by race or gender than the fairness-unaware model. As such, these models may be considered viable candidates for deployment in the model selection process.

It is worth noting, however, that while these models appear to be less prone to bias based on sensitive features, they may still exhibit unfairness towards other features that have not been declared sensitive.

Summary

The significance of responsible AI and fairness in machine learning cannot be overstated. With AI systems becoming increasingly integrated into our lives, it’s essential that we proactively address issues of bias and discrimination.

In this article, we’ve taken a small but vital step towards a more equitable world by developing machine learning models that are fair with respect to sensitive features. Our efforts have been focused on ensuring that our models are less biased and more equitable, using techniques such as reweighting and threshold optimization.

The negative consequences of biased models cannot be ignored. They can lead to discrimination and reputational damage, making it even more important to ensure that our AI systems are fair and just. As AI technology continues to advance, there is an increased need for responsible AI and fairness in AI systems, particularly with the development of more powerful generative models that have the potential to revolutionize various industries.

By taking steps to address issues of bias and discrimination in AI systems, we can ensure that these systems promote positive outcomes and benefit everyone equally. We must strive towards a more equitable and just world through responsible AI and fairness in machine learning.

When used wisely, AI and machine learning can help create a fairer and more equitable world. Image created with Midjourney.

Sources and Further Reading

Books on Responsible AI and Applied Machine Learning

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Using Fairlearn to Build Fair Machine Machine Learning Models with Python: Step-by-Step Towards More Responsible AI appeared first on relataly.com.

Generating Detailed Images with OpenAI DALL-E and ChatGPT in Python: A Step-By-Step API Tutorial

Florian Follonier — Sat, 21 Jan 2023 23:20:41 +0000

In this article, we will explore how to automate the creation of AI-generated art by integrating DALL-E with ChatGPT using the respective APIs in Python. ChatGPT, the state-of-the-art language model developed by OpenAI, has recently made waves in the tech community for its exceptional language abilities, such as code generation, prompt answering, and text completion. DALL-E, another powerful language model developed by OpenAI, specializes in generating images from text prompts. This tutorial will utilize the OpenAI GPT3-API to generate a detailed and specific prompt for DALL-E. We will then use the prompt in a request to the DALL-E API to generate images. The generated images will be displayed and saved for future use.

Throughout the tutorial, we will provide clear explanations and code snippets to guide you through the process. By the end of this tutorial, you will have a comprehensive understanding of how to automate prompt generation for DALL-E with Python. So, let’s get started!

If you’re new to the OpenAI API, my recent API tutorial on ChatGPT and other OpenAI language models might be helpful to check out.

Also: Mastering Prompt Engineering for ChatGPT for Business Use

Images created with automated prompt generation for OpenAI DALL-E using ChatGPT in Python

Generating AI-Art using DALL-E: How it Works

DALL-E is a language model developed by OpenAI that is capable of generating images from text prompts. It uses deep learning techniques to understand the input text and generate images that are related to the text. DALL-E is trained on a massive dataset of images and texts, allowing it to generate a wide range of images. This allows the model to take text prompts as input and generate images that are related to the prompt.

The input to a language model is what we call a prompt. A prompt to GPT typically includes a general instruction, a specific topic, and additional keywords (for example, digital art, oil painting, etc.). The DALL-E model then uses this information to generate the images. Usually, four different images are generated per request. DALL-E can do various other things, such as completing or altering existing images. However, this article will focus on image generation.

The images generated by DALL-E can be of various types, such as illustrations, drawings, photographs, etc. While the quality of the images is generally surprisingly good, the model can do some things, certainly better than others. For example, you will find that human faces and bodies sometimes lead to odd results. But, in general, images are often surprisingly coherent with the provided text prompts.

AI-generated images can be useful for various applications. These include creating AI-generated art, creating illustrations for books, creating images for social media, creating product images for e-commerce, and more. In many cases, generating images from text prompts can save time and resources, as it eliminates the need for manual image creation.

UI of the OpenAI DALL-E2 service for AI-generated images. The AI services generate images based on a manual prompt defined by the user.

Automated DALL-E Prompts using ChatGPT

he usual way to generate images with DALL-E is via a manual prompt on the OpenAI website. However, DALL-E also offers an API, which allows for automating image generation. Of course, you could send a manually written prompt to the API and automatically process the response. However, if you want to optimize the quality of the generated images or consistently automate the whole process, there is a better way of doing this using ChatGPT.

You may have heard of ChatGPTs abilities to complete and generate high-quality text. However, few people know that ChatGPT can also generate prompts for DALL-E. This works by simply telling the model to generate a prompt for DALL-E and then specifying the topic for which you want to create the prompt. An example prompt could be:

"generate a prompt for DALL-E on a robot on the beach".

And the response from ChatGPT:

The gleeful robot lounged on the sun-drenched beach, soaking up the warm rays and listening to the soothing crash of the waves. It wore a bright, multicolored swimsuit and a wide-brimmed hat to protect its circuitry from the intense heat. It smiled contentedly as it watched the seagulls soar through the azure sky and the cheerful children playing in the foamy surf. The salty-sweet scent of the sea filled its senses, and it felt truly relaxed and rejuvenated.

As you can see, the response is very detailed and uses a lot of adjectives. As a result, the images generated with these prompts are often highly creative and detailed. Below is the result from DALL-E for this specific prompt:

Example DALL-E creation for a ChatGPT-generated prompt using the command: “generate a prompt for DALL-E on a robot on the beach.”

Why You May Want to Automate Prompt Generation for DALL-E

Automating prompt generation for DALL-E has several benefits. Some of the reasons why you may want to automate prompt generation include the following:

Efficiency: Automating the prompt generation process can save time and resources as it eliminates the need for manual input.
More Details: The prompts generated by ChatGPT are typically more detailed than what humans typically use to generate images.
Consistency: By using a language model like ChatGPT to generate prompts, you can ensure that the prompts are grammatically correct and well-formed, which can improve the quality of the generated images.
Variety: By using ChatGPT to generate prompts, you can include additional keywords, making the generation more diverse and less repetitive.
Automation: When incorporating AI-generated images into an integrated process, utilizing APIs to automate the process is essential. For instance, an integration with Twitter can be implemented where ChatGPT automatically picks up keywords from tweets and generates images based on those keywords, which can be published on Twitter.

Using ChatGPT to generate DALL-E prompts allows for more efficient and accurate image generation and the ability to generate more detailed images. It also reduces the need for manual input and allows you to integrate image generation models into complex processes.

The images generated based on the ChatGPT prompts are often superior in detail and creativity.

Automated Dall-E Prompt Generation using ChatGPT in Python

In the following, we will generate a Python script that integrates DALL-E with ChatGPT to create AI-generated images from keywords or short descriptions.

Here are the general steps involved in generating DALL-E prompts using ChatGPT in Python:

To use the OpenAI models, we will first need to authenticate with the OpenAI API by providing our API key. In the next section, we will look at how you can register for a key. We will also briefly discuss the costs of using OpenAI models.
Define a ChatGPT Prompt: An image prompt is the text input that the DALL-E model uses to generate a response. We will use ChatGPT to generate this prompt
Generate a Prompt Design with ChatGPT: Generate a response: Once we have a prompt, we can use the GPT-3 model to generate a prompt for DALL-E.
Send the prompt to DALL-E API: The response obtained from the above step is sent to DALL-E API to generate the images.
Process the Image Response from DALL-E: Once we have the image from the DALL-E API, we print the images and save them to a local folder using.

Let’s get started!

View on GitHub Relataly GitHub Repo

Register for an OpenAI API Key

To use the OpenAI API, you will first need to register for an API key by visiting the OpenAI website and creating an account. During the registration process, you will be required to provide some basic information about yourself and the project you are working on. In addition, you will need to add a payment method. The cost per request is a couple of cents, depending on which model you use. In this tutorial, we will use the Davinci model, which is 0.02$ per 1000 tokens.

It’s important to note that while GPT-3 is currently available in a free test version, the OpenAI API itself is not free. If you only plan to send a few test requests, the costs will be minimal, but if you integrate the API with a successful application that runs in production, the costs can quickly accumulate.

Each language model offered by OpenAI has a different price tag, and charges depend on various factors. For language models, charges are based on the number of tokens sent to the model and the type of the model.

Prices for image models depend on the resolution at which you generate the images.

I recommend monitoring usage and keeping track of costs to avoid unexpected charges. To manage costs, you can set up a quota on the costs in the OpenAI portal under your profile. This will help you to keep an eye on the costs and keep them within your budget.

Overview of Prices for OpenAI language models (as of 2023-22-01).

Overview of prices for OpenAI image models (as of 2023-22-01).

Technical Setup

Before diving into the code, it’s essential to ensure that you have the proper setup for your Python 3 environment and have installed all the necessary packages. If you do not have a Python environment, you can follow the instructions in this tutorial to set up the Anaconda Python environment. This will provide you with a robust and versatile environment that is well-suited for machine learning and data science tasks.

Before you can start generating DALL-E prompts, you’ll need to install the OpenAI library for Python, which provides access to the GPT-3 model. You can do this by running “pip install openai” in your command line.

In addition, this tutorial will work with matplotlib and PIL and standard libraries such as yaml, os, and datetime. You can install the OpenAI Python library using console commands:

pip install
conda install (if you are using the anaconda packet manager)

Step #1 Imports and Autenthication at the OpenAI API

To begin, we import the required libraries and provide our API key for authorization. I recommend to store the API key in a separate file, such as api_config_openai.yml. The code below will read the API key from this file and make it accsessible in the code. You can also place the API key directly in the code, but be mindful not to make it publicly accessible.

With the API key set up, you can use the OpenAI API’s “Model.list()” function to retrieve a list of available models. For this you have to uncomment the last four lines in the following code snippet.

import openai
import yaml
import urllib.request
from PIL import Image
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import os
from datetime import datetime
# set the API Key 
yaml_file = open('API Keys/api_config_openai.yml', 'r')  
p = yaml.load(yaml_file, Loader=yaml.FullLoader)
openai.api_key = p['api_key']
# show available openai language models (this tutorial uses davinci003)
# modellist = openai.Model.list()
# for i in modellist.data:
#     print(i.id)

Step #2 Define a ChatGPT Prompt

Next, we define the prompt for ChatGPT in which we request a prompt for DALL-E.

Defining Prompts

When we define prompts for ChatGPT there are a few things to keep in mind. The primary focus should be to provide a clear and specific task or question. Although the AI can still function with incomplete or incorrect information, providing it with detailed instructions will improve its performance. In addition, keyword relevance is essential.

To ensure the AI tool produces the desired results, it’s vital to use relevant keywords in your input. The tool must first understand the input accurately before it can generate the expected output. A well-crafted prompt can improve the tool’s performance and accelerate your progress.

DALL-E also knows certain keywords, that will send the model in the one or the other direction. For example, you can define the type of image you want to be generated by adding keywords such as oil paining, aqurael painting, digital art, or Van-Gogh style. This article provides a good overview of these keywords.

Send the Request to the OpenAI Language Model

We encapsulate this request in a function called “send_openai_request” that takes in three parameters: “engine,” “prompt,” and “max_tokens”. The function uses the OpenAI API’s “Completion.create()” method to send a request to the specified engine with the provided prompt and maximum token limit.

In the code below, we have created a simple start phrase called “prompt_base”: “Generate a detailed Dall-E prompt with several adjectives for”. You can then simply add the topic for which you want to generate the images as “prompt_details.” Alternatively, you can append “additional_keywords” to the prompt that will be added after the language model has generated the prompt for DALL-E.

There are different language models available but the one that creates the most detailed results and is closest to ChatGPT is “text-davinci-003”. This is the version used in this tutorial. Finally, we send the request and print out the response with the generated prompt.

The prompt is constructed using a base prompt, which is a general instruction to the model, a specific topic “sugar castle” and additional keywords “digital art”. The function “send_openai_request” is defined to handle the request to the language model, it takes three parameters “engine” (model version), “prompt” (the instruction for the model) and “max_tokens” (maximum number of tokens in the response). We are sending a request to the OpenAI API to generate a response, which is returned and stored in the variable “response”.

# define the request
def send_openai_request(engine, prompt, max_tokens=1024):
    response = openai.Completion.create(
        engine=engine,
        prompt=prompt,
        max_tokens=max_tokens,
        n=1,
        stop=None,
        temperature=0.7
    )
    return response
# define the prompt to the language model
prompt_base = "Generate a detailed Dall-E prompt with several adjectives for " # an introduction text telling the language model what to do
prompt_details = "sugar castle" # the topic for which you wish to generate the images 
additional_keywords = ",digital art" # these keywords will be added after the language model generated the prompt. Example: "digital art", "oil painting", "water color painting", "high quality"
model="text-davinci-003" # the version of the openai language model
# generate a response
response = send_openai_request(model, prompt_base + prompt_details)
# print the response from the language model
generated_prompt = response["choices"][0]["text"]
print(generated_prompt)

Make me a picture of a majestic, shimmering, sparkling sugar castle with a dazzling crystal spire and enchanting turrets, surrounded by an emerald green moat and a towering rainbow-hued wall.

Step #3 Generate a Prompt Design with ChatGPT

Next, we specify the parameters for generating images using the OpenAI DALL-E API. We have set the variable “number_of_images” to 2, which means it will generate 2 images. In addition, we set the “image_size” to “512×512,” which is the size of the images to be generated.

We create the final prompt to DALL-E called “image_generation_prompt” by combining the “generated_prompt” variable obtained from the previous text prompt and the “additional_keywords” variable.

Finally, we are printing out a message indicating that DALL-E will generate the specified number of images at the specified size, using the provided prompt.

# image parameters
number_of_images = 2 # how many images you want to generate
image_size = "512x512" # the size of the images
image_generation_prompt = f"{generated_prompt} {additional_keywords}"
print(f"Dall-e will generate {number_of_images} images {image_size} using the following prompt: {image_generation_prompt}")

Dall-e will generate 2 images 512x512 using the following prompt: 
Make me a picture of a majestic, shimmering, sparkling sugar castle with a dazzling crystal spire and enchanting turrets, surrounded by an emerald green moat and a towering rainbow-hued wall. digital art

Step #4 Send the Generated Prompt to the DALL-E API

Next, we send the request to the OpenAI DALL-E API using the “image_generation_prompt” variable created earlier. We are sending the request using OpenAI’s “Image.create()” method, which takes the prompt, number of images, and size as parameters. The response from the API is stored in the “response” variable.

We are looping through the response data and appending the URLs of each image in the list. Finally, we call the function get_images on the response, storing the resulting image URLs in the “image_list” variable and displaying the image URLs in the output.

# define and send the request to dall-e with the generated prompt
response = openai.Image.create(
    prompt=image_generation_prompt,
    n=number_of_images,
    size=image_size,
)
# set the timestamp for data processing
timestamp_string = response.created
datetime_string = datetime.fromtimestamp(timestamp_string).strftime("%Y%m%d%H%M%S")
# get the image(s) from the response
def get_images(response):
    # generate an empty list for the image urls
    image_list = []
    # store the image urls in the list
    for imgurl in response.data:
        image_list.append(imgurl.url)
    return image_list
image_list = get_images(response)
#display image urls
print(image_list)

['https://oaidalleapiprodscus.blob.core.windows.net/private/org-eO72e4aFm9XJBw4sb91Z8XEX/user-9ZRwLxYMDBxvw6gLEywP44xa/img-BdkqLm65g8QLIpLbgHhDkxbf.png?st=2023-01-22T19%3A03%3A51Z&se=2023-01-22T21%3A03%3A51Z&sp=r&sv=2021-08-06&sr=b&rscd=inline&rsct=image/png&skoid=6aaadede-4fb3-4698-a8f6-684d7786b067&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2023-01-22T17%3A19%3A13Z&ske=2023-01-23T17%3A19%3A13Z&sks=b&skv=2021-08-06&sig=vQrKcGSKmxXKDikYpx9xvmHeBRcdQoxRH%2B8%2BjYogjz8%3D', 'https://oaidalleapiprodscus.blob.core.windows.net/private/org-eO72e4aFm9XJBw4sb91Z8XEX/user-9ZRwLxYMDBxvw6gLEywP44xa/img-YnicWoD32S1DBBYzyhMnEHjo.png?st=2023-01-22T19%3A03%3A51Z&se=2023-01-22T21%3A03%3A51Z&sp=r&sv=2021-08-06&sr=b&rscd=inline&rsct=image/png&skoid=6aaadede-4fb3-4698-a8f6-684d7786b067&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2023-01-22T17%3A19%3A13Z&ske=2023-01-23T17%3A19%3A13Z&sks=b&skv=2021-08-06&sig=o2kK5GIKX7lIqUaFQs8e1Sa3JWRyEZr6cVfKYPAjH%2BY%3D']

Step #5 Process the Image Response from DALL-E

Now that we have the image URLs, let’s see what DALL-E has generated. We use the Matplotlib library to display the generated images. Then we define a save path for the images using the timestamp and the details of the prompt. As the following code iterates over the URLs of the images, it stores the images with a unique filename for each image and saves them in the created directory.

# display the images
fig, axs = plt.subplots(nrows=1, ncols=len(image_list), figsize=(10, 10))
for i, imgurl in enumerate(image_list):
    ax = axs[i]
    img = mpimg.imread(imgurl)
    imgplot = ax.imshow(img)
    ax.set_xticks([]); ax.set_yticks([])
# define and create the save path
save_path = f"dall-e_images/{datetime_string}_{prompt_details.replace(' ', '-')}"
os.makedirs(save_path)
print(f"images stored under the following path: {save_path}")
# store the images
for i, imgurl in enumerate(image_list):
    # set the file name
    filename = f"{datetime_string}_dall-e{i}.png"
    # save the image
    img = mpimg.imread(imgurl)
    mpimg.imsave(f'{save_path}/{filename}', img)

images stored under the following path: dall-e_images/20230122210351_sugar-castle

Wow, what a beautiful sugar castle! That’s it, now the images are stored on your local computer, and you can process them further.

Summary

Generating DALL-E prompts using ChatGPT in Python is a powerful and flexible way to create unique images from text prompts. By following the steps outlined in this tutorial, you can use the OpenAI library and GPT-3.5 model to create AI-generated images for various topics. This process can be repeated with different prompts to generate a wide variety of images.

You can now experiment with the latest advancements in natural language processing and image generation to create your own unique and captivating images.

I hope this article has provided a useful introduction to working with OpenAI’s language models in Python and that you will continue to explore the full range of capabilities offered by the API.

If you have any questions, let me know in the comments.

Sources and Further Readings

OpenAI Chat
OpenAI Guides
OpenAI API Reference
Dall-E API Reference
TechCrunch OpenAI News
Using OpenAI language models via APIs in Python
Relataly.com – Whats the Business Value of GPT-3?
Relataly.com Business Use Cases for OpenAI GPT-3
DALL·E 2: The Ultimate Guide
OpenAI ChatGPT was used to revise this article
Images generated using Dall-E and Midjourney AI for Image Generation from Text

The post Generating Detailed Images with OpenAI DALL-E and ChatGPT in Python: A Step-By-Step API Tutorial appeared first on relataly.com.

Predictive Maintenance: Predicting Machine Failure using Sensor Data with XGBoost and Python

Florian Follonier — Sun, 08 Jan 2023 20:34:44 +0000

Predictive maintenance is a game-changer for the modern industry. Still, it is based on a simple idea: By using machine learning algorithms, businesses can predict equipment failures before they happen. This approach can help businesses improve their operations by reducing the need for reactive, unplanned maintenance and by enabling them to schedule maintenance activities during planned downtime. In this article, we’ll explore the use of machine learning algorithms to predict machine failures using the robust XGBoost algorithm in Python. By the end of this tutorial, you’ll have the knowledge and skills to start implementing predictive maintenance in your organization. So, let’s get started!

We begin by discussing the concept of predictive maintenance and show different ways to implement it. Then we will turn to the coding part in python and implement the prediction model based on machine sensor data. We train a classification model that predicts different types of machine failure using XGBoost.

Predictive maintenance is a game-changer for the modern industry. Image generated with Midjourney.

What is Predictive Maintenance?

Predictive maintenance is a data-driven approach that uses predictive modeling to assess the state of equipment and determine the optimal timing for maintenance activities. This technique is particularly beneficial in industries that heavily rely on equipment for their operations, such as manufacturing, transportation, energy, and healthcare. Depending on the requirements and challenges of an organization, predictive maintenance may contribute to one or several of the following goals:

Improve equipment reliability: By proactively identifying and addressing potential problems with equipment, predictive maintenance can help improve the reliability of the equipment, reducing the risk of unexpected downtime or failure.
Increase efficiency: Predictive maintenance can help improve the efficiency of equipment by identifying and fixing problems before they cause equipment failure or downtime. This can help reduce maintenance costs and increase productivity.
Improve safety: Predictive maintenance can help improve safety by identifying and addressing potential problems with equipment before they occur. This can help prevent accidents and injuries caused by equipment failure.
Reduce maintenance costs: By proactively identifying and fixing potential problems with equipment, predictive maintenance can help reduce the overall cost of maintenance by minimizing the need for unscheduled downtime.
Improve asset management: Predictive maintenance can help improve asset management by providing data and insights into the condition and performance of equipment. This can help organizations decide when to replace or upgrade equipment.

Next, we look at the different ways organizations can implement predictive maintenance.

Utilities and manufacturing are only two of the many industries that use predictive maintenance. Image generated with Midjourney.

Approaches to Predictive Maintenance

There are several approaches to implementing a predictive maintenance solution, depending on the type of equipment being monitored and the resources available. These approaches include:

Condition-based monitoring: This involves continuously monitoring the condition of the equipment using sensors. When certain thresholds or conditions are met, an alert is triggered, or corrective measures are launched. The goal is to reduce the risk of failure. For example, if the temperature of a motor exceeds a certain level, this may indicate that the motor is about to fail.
Predictive modeling: This approach involves using machine learning algorithms to analyze historical lifetime data about the equipment to identify patterns that may indicate an impending failure. This can be done using data from sensors, as well as operational data and maintenance records. When historical or failure data is not available, a degradation model can be created to estimate failure times based on a threshold value. This approach is often used when there is limited data available.
Prognostic algorithms: By using data from sensors and other sources, prognostic algorithms can predict the remaining useful life of a piece of equipment. This information can help organizations determine the likelihood of a breakdown and plan for replacements or maintenance activities. By understanding the equipment better, organizations can potentially extend maintenance cycles, which can reduce costs for replacements and maintenance.

It is important to choose an approach that is appropriate for the specific equipment and maintenance challenges faced by the organization.

Data Requirements

When implementing predictive maintenance, it is important to consider that each approach comes with its own set of data requirements. Types of data include the following:

Current condition data includes information about the state of the equipment, such as its temperature, pressure, vibration, and other physical parameters.
Operating data includes information about how the equipment is being used, such as its load, speed, and other operating parameters.
Maintenance history data includes information about past maintenance activities that have been performed on the equipment.
Failure history data includes information about past equipment failures, such as the date of the failure, the cause of the failure, and the impact on operations.

Collecting these data requires investing in sensors and other data collection infrastructure and ensuring that data collection is accurate and storage is proper. By combining various data types, organizations can create a comprehensive view of equipment condition and performance and use it to predict maintenance requirements.

The specific types of data needed will depend on the implementation approach. Organizations must ensure they have access to the necessary data to implement the selected approach effectively. Some specific data requirements for each approach include the following:

Approach	Data Requirements
Condition-based monitoring	Sensor data from the equipment being monitored.
Predictive modeling	A combination of sensor data, operational data, and maintenance records.
Prognostic algorithms	Sensor data, as well as data about past failures and maintenance events.

Data requirements per implementation approach

Predictive maintenance – Machine learning can make maintenance cycles more cost-efficient. Image generated using Midjourney

Predicting Failures in Milling Machines using XGBoost in Python

Now that we have a basic understanding of predictive maintenance, it’s time to get hands-on with Python. We will use sensor data and machine learning to predict failures in milling machines. But why do these machines break down in the first place? Milling machines have many moving parts that can suffer from wear and tear over time, leading to failures. Additionally, improper maintenance can cause issues with machine operation and lead to costly damage. Efficient maintenance can be challenging due to the varying loads that milling machines are subjected to. However, by implementing a predictive maintenance solution with Python, we can proactively identify and address issues to prevent costly downtime and ensure the smooth operation of our milling machines. Our goal is to predict one of five failure types, which corresponds to a predictive modeling approach. Let’s get started on building our predictive maintenance solution.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Image of a CNC milling machine. Image created with Midjourney

Prerequisites

Before starting the coding part, make sure that you have set up your Python 3 environment and required packages.

Python Environment

Before diving into the FairLearn Python tutorial, it is important to take the necessary steps to ensure that your Python environment is properly set up and that you have all the required packages installed. This will ensure a seamless learning experience and prevent any potential roadblocks or issues that may arise due to an improperly configured environment.

If you don’t have an environment, follow this tutorial to set up the Anaconda environment.

Python Packages

Make sure you install all required packages. In this tutorial, we will be working with the following packages:

Pandas
NumPy
Matplotlib
Seaborn
Plotly

In addition, we will be using the machine learning library Scikit-learn and the XGBoost library, which is a popular library for training gradient-boosting models.

You can install packages using console commands:

pip install 
conda install  (if you are using the anaconda packet manager)

About the Sensor Dataset

In this tutorial, we will work with a synthetic sensor dataset from the UCL ML archives that simulates the typical life cycle of a milling machine. The dataset contains the following fields:

The dataset consists of 10 000 data points stored as rows with 14 features in columns:

UID: unique identifier ranging from 1 to 10000
productID: consisting of a letter L, M, or H for low (50% of all products), medium (30%), and high (20%) as product quality variants and a variant-specific serial number
air temperature [K]
process temperature [K]
rotational speed [rpm]
torque [Nm]
tool wear [min]
machine failure. A label that indicates whether the machine has failed or not
Failure type (prediction label). The label contains five failure types: tool wear failure (TWF), heat dissipation failure (HDF), power failure (PWF), overstrain failure (OSF), random failures (RNF)

Source: UCL ML Repository

You can download the dataset from Kaggle.com. Unzip the file predictive_maintenance.csv and save it under the following file path: “/data/iot/classification/”

Step #1 Load the Data

We begin by importing the required libraries. This also includes the XGBoost library, which is a popular library for training gradient-boosting models. In addition, we will load the dataset using the pandas library. Then we define our target variable as Failure Type. The dataset contains a second target column, which only contains the binary information of machine failures. We will drop this column, as our goal is to predict the specific type of failure. Then we print the first three rows of the loaded dataset.

# A tutorial for this file is available at www.relataly.com
# Tested with Python 3.9.13, Matplotlib 3.6.2, Scikit-learn 1.2, Seaborn 0.12.1, numpy 1.21.5, xgboost 1.7.2

import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np
import seaborn as sns
import plotly.express as px
sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support as score, roc_curve
from sklearn.model_selection import cross_val_score, train_test_split, cross_validate
from sklearn.utils import compute_sample_weight
from xgboost import XGBClassifier

# load the train data
path = '/data/iot/classification/'
df = pd.read_csv(path + "predictive_maintenance.csv") 

# define the target
target_name='Failure Type'

# drop a redundant columns
df.drop(columns=['Target'], inplace=True)

# print a summary of the train data
print(df.shape[0])
df.head(3)

	UDI	Product ID	Type	Air temperature [K]	Process temperature [K]	Rotational speed [rpm]	Torque [Nm]	Tool wear [min]	Failure Type
0	1	M14860		M		298.1				308.6				1551						42.8		0				No Failure
1	2	L47181		L		298.2				308.7				1408						46.3		3				No Failure
2	3	L47182		L		298.1				308.5				1498						49.4		5				No Failure

Step #2 Clean the Data

Next, we quickly check the data quality of our dataset. The following code block checks if there are any missing values in our dataset. If there are missing values, it creates a barplot showing the number of missing values for each column, along with the percentage of missing values. If there are no missing values, it prints a message saying “no missing values.”

The function then drops any columns with more than 5% missing values from the DataFrame. Finally, it prints the names of the remaining columns in the DataFrame. This function can be used to identify and handle missing values in a dataset before applying machine learning algorithms to it.

# check for missing values
def print_missing_values(df):
    null_df = pd.DataFrame(df.isna().sum(), columns=['null_values']).sort_values(['null_values'], ascending=False)
    fig = plt.subplots(figsize=(16, 6))
    ax = sns.barplot(data=null_df, x='null_values', y=null_df.index, color='royalblue')
    pct_values = [' {:g}'.format(elm) + ' ({:.1%})'.format(elm/len(df)) for elm in list(null_df['null_values'])]
    ax.set_title('Overview of missing values')
    ax.bar_label(container=ax.containers[0], labels=pct_values, size=12)

if df.isna().sum().sum() > 0:
    print_missing_values(df)
else:
    print('no missing values')

# drop all columns with more than 5% missing values
for col_name in df.columns:
    if df[col_name].isna().sum()/df.shape[0] > 0.05:
        df.drop(columns=[col_name], inplace=True) 

df.columns

no missing values
Index(['UDI', 'Product ID', 'Type', 'Air temperature [K]',
       'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]',
       'Tool wear [min]', 'Failure Type'],
      dtype='object')

Next, we will drop two unnecessary columns and rename the remaining ones to make them easier to work with. The original column names are quite long and contain special characters that could cause errors during the training process. Once the columns are renamed, we will print the updated DataFrame to verify the changes.

# drop id columns
df_base = df.drop(columns=['Product ID', 'UDI'])

# adjust column names
df_base.rename(columns={'Air temperature [K]': 'air_temperature', 
                        'Process temperature [K]': 'process_temperature', 
                        'Rotational speed [rpm]':'rotational_speed', 
                        'Torque [Nm]': 'torque', 
                        'Tool wear [min]': 'tool_wear'}, inplace=True)
df_base.head()

	Type	air_temperature	process_temperature	rotational_speed	torque	tool_wear	Failure Type
0	M		298.1			308.6				1551				42.8	0			No Failure
1	L		298.2			308.7				1408				46.3	3			No Failure
2	L		298.1			308.5				1498				49.4	5			No Failure
3	L		298.2			308.6				1433				39.5	7			No Failure
4	L		298.2			308.7				1408				40.0	9			No Failure

Everything looks as expected: Our dataset contains six features and the target column with the five failure types.

Step #3 Explore the Data

Next, let’s explore the dataset.

Target Class Distribution

The following code uses the plotly express library to create a histogram showing the class distribution of the “Failure Type” column in a DataFrame called “df_base.” The histogram will have one bar for each unique value in the “Failure Type” column, and the height of each bar will represent the number of occurrences of that value in the column. This can be useful for understanding the imbalance in the distribution of classes in a classification problem.

# display class distribution of the target variable
px.histogram(df_base, y="Failure Type", color="Failure Type")

Our dataset is highly imbalanced, with the vast majority of cases having a “No Failure” label. If the dataset is highly imbalanced, with a disproportionate number of cases in one class compared to the others, it can impact the performance of machine learning models. This is because imbalanced datasets can lead to models that are biased towards the majority class, and may not perform well on the minority class. In order to improve model performance on imbalanced datasets, we will later adjust the model hyperparameters accordingly.

Feature Pairplots

Next, let’s construct pair plots to explore feature relations with the target variable. Pair plots, also known as scatter plots, are a type of plot that shows the relationship between two variables. In the context of a predictive maintenance dataset, pair plots can be useful for exploring the relationships between different features and the target variable (e.g., the likelihood of a machine failure). By creating pair plots and visualizing the relationships between different features and the target variable, you can gain insights into which features might be most useful for building a predictive model.

# pairplots on failure type
sns.pairplot(df_base, height=2.5, hue='Failure Type')

The pair plots reveal valuable patterns in our features that can inform the predictions of our model. For instance, we see that Power Failures tend to be correlated with torque values that are either close to the maximum or minimum. Such patterns should allow our predictive model to make solid predictions.

Feature Correlation

Next, we will look at feature correlation. The following code block creates a heatmap using the seaborn library that shows the correlation between all pairs of columns in a DataFrame called “df_base”. The heatmap is plotted using a color scale, with warmer colors indicating stronger correlations and cooler colors indicating weaker correlations. The correlation values are also displayed in the cells of the heatmap, with values ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation). By creating a heatmap, you can quickly see which variables are positively or negatively correlated with each other, and to what degree. This can be helpful for identifying which features might be most useful for building a predictive model.

# correlation plot
plt.figure(figsize=(6,4))
sns.heatmap(df_base.corr(), cbar=True, fmt='.1f', vmax=0.8, annot=True, cmap='Blues')

From the table, it looks like there is a strong positive correlation between “air_temperature” and “process_temperature” (0.87). This makes sense since a high process temperature will naturally also heat up the air around the machine. In addition, there is a strong negative correlation between “rotational_speed” and “torque” (-0.87). The other correlations are weaker and closer to 0, indicating weaker relationships.

Understanding the correlations between different variables in a dataset can be helpful for building predictive models, as it can give you an idea of which features might be most important for predicting a given target. It can also help you identify any redundant features that might not add much value to your model. Since our dataset only contains six features, we will keep all of them.

Feature Boxplots

Box plots are a useful visualization tool for understanding the distribution of values in a dataset. They show the minimum, first quartile, median, third quartile, and maximum values for each group, as well as any outliers. By creating box plots separated by a categorical variable, you can compare the distributions of values between different groups and see if there are any significant differences. This can be useful for identifying trends or patterns in the data that might be useful for building a predictive model.

If there are significant differences between the boxplots for different categories, it could be a good sign for building a predictive model. For example, if the boxplots for one category tend to have higher values for a particular feature than the boxplots for another category, it could indicate that the feature is related to the target variable and could be useful for making predictions.

# create histograms for feature columns separated by target column
def create_histogram(column_name):
    plt.figure(figsize=(16,6))
    return px.box(data_frame=df_base, y=column_name, color='Failure Type', points="all", width=1200)

create_histogram('air_temperature')

Feature boxplot for process_temperature.

create_histogram('process_temperature')

Feature boxplot for rotational speed.

create_histogram('rotational_speed')

Feature boxplot for torque.

create_histogram('torque')

Feature boxplot for tool wear.

create_histogram('tool_wear')

Now that we have a good understanding of our dataset, we can prepare the data for model training.

Step #4 Data Preparation

To prepare the data for model training, we will need to split our dataset and make additional modifications.

The following code block contains a reusable function called data_preparation. The purpose of this function is to prepare the data in a way that is suitable for building and evaluating machine learning models. It performs several preprocessing steps, such as encoding categorical variables and splitting the data into training and test sets.

def data_preparation(df_base, target_name):
    df = df_base.dropna()

    df['target_name_encoded'] = df[target_name].replace({'No Failure': 0, 'Power Failure': 1, 'Tool Wear Failure': 2, 'Overstrain Failure': 3, 'Random Failures': 4, 'Heat Dissipation Failure': 5})
    df['Type'].replace({'L': 0, 'M': 1, 'H': 2}, inplace=True)
    X = df.drop(columns=[target_name, 'target_name_encoded'])
    y = df['target_name_encoded'] #Prediction label

    # split the data into x_train and y_train data sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=0)

    # print the shapes: the result is: (rows, training_sequence, features) (prediction value, )
    print('train: ', X_train.shape, y_train.shape)
    print('test: ', X_test.shape, y_test.shape)
    return X, y, X_train, X_test, y_train, y_test

# remove target from training data
X, y, X_train, X_test, y_train, y_test = data_preparation(df_base, target_name)

Step #5 Model Training

Now that we have prepared the dataset, we can train the XGBoost classification model. The basic idea behind XGBoost is to train a series of weak models, such as decision trees, and then combine their predictions using gradient boosting. During training, XGBoost uses an optimization algorithm to adjust the weight of each model in the ensemble in order to improve the overall prediction accuracy. XGBoost also includes a number of additional features and techniques that help to improve the performance of the model, such as regularization, feature selection, and handling missing values.

XGboost provides several configuration options that we can use to finetune performance and adjust the training process to our dataset. For a complete list of hyperparameters, please see the library documentation.

Remember that our class labels are imbalanced. Therefore, we will provide the model with sample weights. The following code creates a weight array for the training and test sets using the “compute_sample_weight” function from scikit-learn. We calculate the weight array based on the “balanced” mode. This means that the weights are calculated such that the class distribution in the sample is balanced. This can be useful when working with imbalanced datasets, as it helps to mitigate the effects of class imbalance on the model.

weight_train = compute_sample_weight('balanced', y_train)
weight_test = compute_sample_weight('balanced', y_test)

xgb_clf = XGBClassifier(booster='gbtree', 
                        tree_method='gpu_hist', 
                        sampling_method='gradient_based', 
                        eval_metric='aucpr', 
                        objective='multi:softmax', 
                        num_class=6)
# fit the model to the data
xgb_clf.fit(X_train, y_train.ravel(), sample_weight=weight_train)

We can see that the blue box summarizes the configuration of our model and indicates that the training process has been successful. Now that we have the classifier, we can use it to make predictions on new data.

Step #6 Model Evaluation

Finally, we will evaluate the model’s performance. This will involve three steps:

Model scoring
Cross-validation
Confusion matrix

Model Scoring

First, we calculate the accuracy of the classifier on the test set using the “score” method. To account for the imbalance of class labels, we pass in the weight array for the test set as an additional parameter. This returns the fraction of correct predictions made by the classifier. Next, the code uses the classifier to make predictions on the test set using the “predict” method. It then generates a classification report using the “classification_report” function from scikit-learn. The report displays a summary of the model’s performance in terms of various evaluation metrics such as precision, recall, and f1-score.

# score the model with the test dataset
score = xgb_clf.score(X_test, y_test.ravel(), sample_weight=weight_test)

# predict on the test dataset
y_pred = xgb_clf.predict(X_test)

# print a classification report
results_log = classification_report(y_test, y_pred)
print(results_log)

precision    recall  f1-score   support

           0       0.99      0.98      0.99      2903
           1       0.64      0.88      0.74        24
           2       0.04      0.08      0.06        12
           3       0.77      0.89      0.83        27
           4       0.00      0.00      0.00         4
           5       0.76      0.97      0.85        30

    accuracy                           0.98      3000
   macro avg       0.53      0.63      0.58      3000
weighted avg       0.98      0.98      0.98      3000

The classification report shows the performance of our XGBoost classifier on the test dataset. The model appears to perform well, with a high accuracy of 0.98 and a high weighted average f1-score of 0.98.

However, there are a few classes where the model’s performance is not as strong. Class 1 has a relatively low precision of 0.64 and a low f1-score of 0.74, while class 2 has a very low precision of 0.04 and a low f1-score of 0.06. Class 4 has a precision and f1-score of 0.00, which suggests that the model is not making any correct predictions for this class.

It is also worth noting that the support for some classes is much lower than for others. Class 1 has a support of 24, while class 0 has a support of 2903. This is due to the fact that there are relatively few instances of class 1 in the test dataset compared to class 0, which affects the model’s performance on class 1.

Confusion Matrix

Next, we create a confusion matrix. We input the true labels of the test set (y_test) and the predicted labels produced by the model (y_pred) to generate the matrix. The matrix shows us the number of correct and incorrect predictions made by the model for each class.

We then create a DataFrame from the confusion matrix and use the seaborn library to visualize the matrix as a heatmap. The heatmap allows us to easily see which classes are being predicted correctly and which are being misclassified.

# create predictions on the test dataset
y_pred = xgb_clf.predict(X_test)

# print a multi-Class Confusion Matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cnf_matrix, columns=np.unique(y_test), index=np.unique(y_test))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize = (8, 5))
sns.set(font_scale=1.1) #for label size
sns.heatmap(df_cm, cbar=True, cmap= "inferno", annot=True, fmt='.0f')

The color scale of the heatmap indicates the magnitude of the values in the matrix. In this case, the darker the color, the higher the number of predictions. This visualization helps us to understand the performance of the model and identify areas for improvement.

Here are a few things that we can learn from this matrix:

The model made a total of 2902 correct predictions and 67 incorrect predictions.
For the “No Failure” class, the model made 2854 correct predictions and 29 incorrect predictions. The majority of the incorrect predictions were false negatives.
For the “Power Failure” class, the model made 21 correct predictions and three incorrect predictions.
For the “Tool Wear Failure” class, the model made 1 correct prediction and 1 incorrect prediction.
For the “Overstrain Failure” class, the model made 24 correct predictions and 2 incorrect predictions.
For the “Random Failures” class, the model made 29 correct predictions and 4 incorrect predictions.
For the “Heat Dissipation Failure” class, the model made 29 correct predictions and 1 incorrect prediction.

Overall, the model seems to be performing relatively well, but it is making a lot of false negatives for some classes.

Cross Validation

Finally, we perform cross-validation on the training set using the “cross_validate” function from scikit-learn. Cross-validation is a technique for evaluating the performance of a machine learning model by training it on different subsets of the data and evaluating it on the remaining data.

In this case, we will train and evaluate our model 10 times using different splits of the data (specified by the “cv” parameter). We also specify that the evaluation metric should be the weighted f1-score (specified by the “scoring” parameter). We then pass the weight array for the training set to the classifier.

The “cross_validate” function returns a dictionary containing various evaluation metrics for each fold of the cross-validation. We will convert the dictionary to a DataFrame and create a bar plot using the plotly express library to visualize the results. This helps us to understand the consistency and stability of the model’s performance.

# cross validation
scores  = cross_validate(xgb_clf, X_train, y_train, cv=10, scoring="f1_weighted", fit_params={ "sample_weight" :weight_train})
scores_df = pd.DataFrame(scores)
px.bar(x=scores_df.index, y=scores_df.test_score, width=800)

The model performance remains consistent across all folds.

Summary

In this article, we have presented the concept of predictive maintenance and demonstrated how organizations can use this approach to improve their maintenance cycles. The second part of the article provided a hands-on tutorial showing how to implement a predictive maintenance solution for predicting different failure types of a milling machine. We trained a classification model using the XGBoost algorithm and sensor data from the machine.

While the model demonstrated good performance overall, we observed that it was not able to predict all classes with the same level of accuracy. This suggests that there may be opportunities to improve the model’s performance. One potential approach is to balance the dataset by up or down-sampling the data to achieve a more even distribution of classes. By doing so, we can mitigate the effects of class imbalance and potentially improve the model’s predictions for all classes.

By implementing such a predictive maintenance approach, organizations can improve their operational efficiency and ensure the smooth running of their machinery.

I hope this article was helpful. If you have any questions or feedback, let me know in the comments.

Predictive maintenance also plays an essential role in a smart factory. Image created with Midjourney.

Sources and Further Reading

There are many books available on the topics of IoT and predictive maintenance. Here are a few recommendations:

An Introduction to Predictive Maintenance by R Keith Mobley
Predictive Analytics: The Secret to Predicting Future Events Using Big Data and Data Science Techniques Such as Data Mining, Predictive Modelling, Statistics, Data Analysis, and Machine by Richard Hurley
Stephan Matzka, Explainable Artificial Intelligence for Predictive Maintenance Applications, Third International Conference on Artificial Intelligence for Industries (AI4I 2020)
David Forsyth (2019) Applied Machine Learning Springer
ChatGPT was used to revise certain parts of this article
Images created using Midjourney and OpenAI Dall-E

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Predictive Maintenance: Predicting Machine Failure using Sensor Data with XGBoost and Python appeared first on relataly.com.

How to Use Hierarchical Clustering For Customer Segmentation in Python

Florian Follonier — Thu, 22 Dec 2022 18:50:14 +0000

Have you ever found yourself wondering how you can better understand your customer base and target your marketing efforts more effectively? One solution is to use hierarchical clustering, a method of grouping customers into clusters based on their characteristics and behaviors. By dividing your customers into distinct groups, you can tailor your marketing campaigns and personalize your marketing efforts to meet the specific needs of each group. This can be especially useful for businesses with large customer bases, as it allows them to target their marketing efforts to specific segments rather than trying to appeal to everyone at once. Additionally, hierarchical clustering can help businesses identify common patterns and trends among their customers, which can be useful for targeting future marketing efforts and improving the overall customer experience. In this tutorial, we will use Python and the scikit-learn library to apply hierarchical (agglomerative) clustering to a dataset of customer data.

The rest of this tutorial proceeds in two parts. The first part will discuss hierarchical clustering and how we can use it to identify clusters in a set of customer data. The second part is a hands-on Python tutorial. We will explore customer health insurance data and apply an agglomerative clustering approach to group the customers into meaningful segments. Finally, we will use a tree-like diagram called a dendrogram, which is helpful for visualizing the structure of the data. The resulting segments could inform our marketing strategies and help us better understand our customers. So let’s get started!

Customer segmentation is a typical use case for clustering. Image generated with Midjourney.

What is Hierarchical Clustering?

So what is hierarchical clustering? Hierarchical clustering is a method of cluster analysis that aims to build a hierarchy of clusters. It creates a tree-like diagram called a dendrogram, which shows the relationships between clusters. There are two main types of hierarchical clustering: agglomerative and divisive.

Agglomerative hierarchical clustering: This is a bottom-up approach in which each data point is treated as a single cluster at the outset. The algorithm iteratively merges the most similar pairs of clusters until all data points are in a single cluster.
Divisive hierarchical clustering: This is a top-down approach in which all data points are treated as a single cluster at the outset. The algorithm iteratively splits the cluster into smaller and smaller subclusters until each data point is in its own cluster.

Agglomerative Clustering

In this article, we will apply the agglomerative clustering approach, which is a bottom-up approach to clustering. The idea is to initially treat each data point in a dataset as its own cluster and then combine the points with other clusters as the algorithm progresses. The process of agglomerative clustering can be broken down into the following steps:

Start with each data point in its own cluster.
Calculate the similarity between all pairs of clusters.
Merge the two most similar clusters.
Repeat steps 2 and 3 until all the data points are in a single cluster or until a predetermined number of clusters is reached.

There are several ways to calculate the similarity between clusters, including using measures such as the Euclidean distance, cosine similarity, or the Jaccard index. The specific measure used can impact the results of the clustering algorithm.

For details on how the clustering approach works, see the Wikipedia page.

Hierarchical clustering is an unsupversied way to classify things.

" data-image-caption="

Hierarchical clustering is an unsupversied way to classify things.

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png" src="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min-430x512.png" alt="Hierarchical clustering is an unsupversied way to classify things. " class="wp-image-13027" srcset="https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png 430w, https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png 252w, https://www.relataly.com/wp-content/uploads/2023/03/mushrooms_and_fruits_pattern-min.png 506w" sizes="(max-width: 430px) 100vw, 430px" />

Hierarchical clustering is an unsupervised technique to classify things based on patterns in their data. Image created with Midjourney.

Hierarchical Clustering vs. K-means

In a previous article, we have already discussed the popular clustering approach k-means. So how are k-means and hierarchical clustering different? Hierarchical clustering and k-means are both clustering algorithms that can be used to group similar data points together. However, there are several key differences between these two approaches:

The number of clusters: In k-means, the number of clusters must be specified in advance, whereas in hierarchical clustering, the number of clusters is not specified. Instead, hierarchical clustering creates a hierarchy of clusters, starting with each data point as its own cluster and then merging the most similar clusters until all data points are in a single cluster.
Cluster shape: K-means produces clusters that are spherical, while hierarchical clustering produces clusters that can have any shape. This means that k-means is better suited for data that is well-separated into distinct, spherical clusters, while hierarchical clustering is more flexible and can handle more complex cluster shapes.
Distance measure: K-means uses a distance measure, such as the Euclidean distance, to calculate the similarity between data points, while hierarchical clustering can use a variety of distance measures. This means that k-means is more sensitive to the scale of the features, while hierarchical clustering is less sensitive to the feature scale.
Computational complexity: K-means is generally faster than hierarchical clustering, especially for large datasets. This is because k-means only requires a single pass through the data to assign data points to clusters, while hierarchical clustering requires multiple passes to merge clusters.
Visualization: Hierarchical clustering produces a tree-like diagram called a “dendrogram.” The dendrogram shows the relationships between clusters. This can be useful for visualizing the structure of the data and understanding how clusters are related.

Next, let’s look at how we can implement a hierarchical clustering model in Python.

Customer Segmentation using Hierarchical Clustering in Python

In this comprehensive guide, we explore the application of hierarchical clustering for effective customer segmentation using a customer dataset. This data-driven segmentation method enables businesses to identify distinct customer clusters based on various factors, including demographics, behaviors, and preferences.

Customer segmentation is a strategic approach that splits a customer base into smaller, more manageable groups with similar characteristics. It aims to better understand the diverse needs and wants of different customer segments to enhance marketing strategies and product development.

Applying customer segmentation through hierarchical clustering allows businesses to personalize their marketing messages, design targeted campaigns, and tailor products to meet the unique needs of each segment. This proactive approach can stimulate increased customer loyalty and sales.

We begin by loading the customer data and selecting the relevant features we want to use for clustering. We then standardize the data using the StandardScaler from scikit-learn. Next, we apply hierarchical clustering using the AgglomerativeClustering method, specifying the number of clusters we want to create. Finally, we add the predictions to the original data as a new column and view the resulting segments by calculating the mean of each feature for each segment.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

The future of healthcare will see a tight collaboration between humans and AI. Image generated using Midjourney

About the Customer Health Insurance Dataset

In this tutorial, we will work with a public dataset on health_insurance_customer_data from kaggle.com. Download the CSV file from Kaggle and copy it into the following path, starting from the folder with your python notebook: data/customer/

The dataset is relatively simple and contains 1338 rows of insured customers. It includes the insurance charges, as well as demographic and personal information such as Age, Sex, BMI, Number of Children, Smoker, and Region. The dataset does not have any undefined or missing values.

Prerequisites

Before we start the coding part, ensure that you have set up your Python 3 environment and the required packages. If you don’t have an environment, follow this tutorial to set up the Anaconda environment. Also, make sure you install all required packages. In this tutorial, we will be working with the following standard packages:

pandas
NumPy
matplotlib
scikit-learn

You can install packages using console commands:

pip install  
conda install  (if you are using the anaconda packet manager)

Step #1 Load the Data

To begin, we need to load the required packages and the data we want to cluster. We will load the data by reading the CSV file via the pandas library.

# import necessary libraries
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import LabelEncoder
from pandas.api.types import is_string_dtype
import pandas as pd
import math
import seaborn as sns

# load customer data
customer_df = pd.read_csv("data/customer/customer_health_insurance.csv")
customer_df.head(3)

	age	sex		bmi		children	smoker	region		charges
0	19	female	27.90	0			yes		southwest	16884.9240
1	18	male	33.77	1			no		southeast	1725.5523
2	28	male	33.00	3			no		southeast	4449.4620

Step #2 Explore the Data

Next, it is a good idea to explore the data and get a sense of its structure and content. This can be done using a variety of methods, such as examining the shape of the dataframe, checking for missing values, and plotting some basic statistics. For example, the following plots will explore the relationships between some of the variables. We won’t go into too much detail here.

def make_kdeplot(df, column_name, target_name):
    fig, ax = plt.subplots(figsize=(10, 6))
    sns.kdeplot(data=df, hue=column_name, x=target_name, ax = ax, linewidth=2,)
    ax.tick_params(axis="x", rotation=90, labelsize=10, length=0)
    ax.set_title(column_name)
    ax.set_xlim(0, df[target_name].quantile(0.99))
    plt.show()

# make kde plot for ext_color 
make_kdeplot(customer_df, 'smoker', 'charges')

# make kde plot for ext_color 
make_kdeplot(customer_df, 'sex', 'charges')

sns.lmplot(x="charges", y="age", hue="smoker", data=customer_df, aspect=2)
plt.show()

def make_boxplot(customer_df, x,y,h):
    fig, ax = plt.subplots(figsize=(10,4))
    box = sns.boxplot(x=x, y=y, hue=h, data=customer_df)
    box.set_xticklabels(box.get_xticklabels())
    fig.subplots_adjust(bottom=0.2)
    plt.tight_layout()

make_boxplot(customer_df, "smoker", "charges", "sex")

make_boxplot(customer_df, "region", "charges", "sex")

make_boxplot(customer_df, "children", "bmi", "sex")

Next, let’s prepare the data for model training.

Step #3 Prepare the Data

Before we can train a model on the data, we must prepare it for modeling. This typically involves selecting the relevant features, handling missing values, and scaling the data. However, we are using a very simple dataset that already has good data quality. Therefore we can limit our data preparation activities to encoding the labels and scaling the data.

To encode the categorical values, we will use label encoder from the scikit-learn library.

# encode categorical features
label_encoder = LabelEncoder()

for col_name in customer_df.columns:
    if (is_string_dtype(customer_df[col_name])):
        customer_df[col_name] = label_encoder.fit_transform(customer_df[col_name])
customer_df.head(3)

Next, we will scale the numeric variables. While scaling the data is an essential preprocessing step for many machine learning algorithms to work effectively, it is generally not necessary for hierarchical clustering. This is because hierarchical clustering is not sensitive to the scale of the features. However, when you use certain distance measures, such as Euclidean distance, scaling the data might still be useful when performing hierarchical clustering. Scaling the data can help to ensure that all of the features are given equal weight. This can be useful if you want to avoid giving more weight to features with larger scales.

# select features
X = customer_df # we will select all features

# standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled.head(3)

array([[-1.43876426, -1.0105187 , -0.45332   , ...,  1.34390459,
         0.2985838 ,  1.97058663],
       [-1.50996545,  0.98959079,  0.5096211 , ...,  0.43849455,
        -0.95368917, -0.5074631 ],
       [-0.79795355,  0.98959079,  0.38330685, ...,  0.43849455,
        -0.72867467, -0.5074631 ],
       ...,
       [-1.50996545, -1.0105187 ,  1.0148781 , ...,  0.43849455,
        -0.96159623, -0.5074631 ],
       [-1.29636188, -1.0105187 , -0.79781341, ...,  1.34390459,
        -0.93036151, -0.5074631 ],
       [ 1.55168573, -1.0105187 , -0.26138796, ..., -0.46691549,
         1.31105347,  1.97058663]])

Step #4 Train the Hierarchical Clustering Algorithm

To train a hierarchical clustering model using scikit-learn, we can use the AgglomerativeClustering or Ward class. The main parameters for these classes are:

n_clusters: The number of clusters to form. This parameter is required for AgglomerativeClustering but is not used for Ward.
affinity: The distance measure used to calculate the similarity between pairs of samples. This can be any of the distance measures implemented in scikit-learn, such as the Euclidean distance or the cosine similarity.
linkage: The method used to calculate the distance between clusters. This can be one of “ward,” “complete,” “average,” or “single.”
distance_threshold: The maximum distance between two clusters that allows them to be merged. This parameter is only used in the AgglomerativeClustering class.

To train the model, we specify the desired parameters and fit the model to the data using the fit_predict method. This method will fit the model to the data and generate predictions in one step.

# apply hierarchical clustering 
model = AgglomerativeClustering(affinity='euclidean')
predicted_segments = model.fit_predict(X_scaled)

Now we have a trained clustering model also predicted the segments for our data.

Step #5 Visualize the Results

After the model is trained, we can visualize the results to get a better understanding of the clusters that were formed. There is a wide range of plots and tools to visualize clusters. In this tutorial, we will use a scatterplot and a dendrogram.

5.1 Scatterplot

For this, we can use the lmplot function in Seaborn. The lmplot creates a 2D scatterplot with an optional overlay of a linear regression model. The plot visualizes the relationship between two variables and fits a linear regression model to the data that can highlight differences. In the following, we use this linear regression model to highlight the differences between our two cluster segments and the age of the customers.

# add predictions to data as a new column
customer_df['segment'] = predicted_segments

# create a scatter plot of the first two features, colored by segment
sns.lmplot(x="charges", y="age", hue="segment", data=customer_df, aspect=2)
plt.show()

We can see that our model has determined two clusters in our data. The clusters seem to correspond well with the smoker category, which indicates that this attribute is decisive in forming relevant groups.

5.2 Dendrogram

The hierarchical clustering approach lets us visualize relationships between different groups in our dataset in a dendrogram. A dendrogram is a graphical representation of a hierarchical structure, such as the relationships between different groups of objects or organisms. It is typically used in biology to show the relationships between different species or taxonomic groups, but it can also be used in other fields to represent the hierarchical structure of any set of data. In a dendrogram, the objects or groups being studied are represented as branches on a tree-like diagram. The branches are usually labeled with the names of the objects or groups, and the lengths of the branches represent the distances or dissimilarities between the objects or groups. The branches are also arranged in a hierarchical manner, with the most closely related objects or groups being placed closer together and the more distantly related ones being placed farther apart.

# Visualize data similarity in a dendogram
def plot_dendrogram(model, **kwargs):
    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, orientation='right',**kwargs)


plt.title("Hierarchical Clustering Dendrogram")
# plot the top three levels of the dendrogram
plot_dendrogram(cluster_model, truncate_mode="level", p=4)
plt.xlabel("Euclidean Distance")
plt.ylabel("Number of points in node (or index of point if no parenthesis).")
plt.show()

Source: This code block is based on code from the scikit-learn page

Summary

In conclusion, hierarchical clustering is a powerful tool for customer segmentation that can help businesses better understand their customer base and target their marketing efforts more effectively. By grouping customers into clusters based on their characteristics and behaviors, companies can create targeted campaigns and personalize their marketing efforts to better meet the needs of each group. Using Python and the scikit-learn library, we were able to apply an agglomerative clustering approach to a dataset of customer data and identify two distinct segments. We can then use these segments to inform our marketing strategies and get a better understanding of our customers.

By the way, customer segmentation is an area where real-world data can be prone to bias and unfairness. If you’re concerned about this, check out our latest article on addressing fairness in machine learning with fairlearn.

I hope this article was useful. If you have any feedback, please write your thoughts in the comments.

Sources and Further Reading

Articles

https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html
Images generated with OpenAI Dall-E and Midjourney.

Books on Clustering

“Data Clustering: Algorithms and Applications” by Charu C. Aggarwal: This book covers a wide range of clustering algorithms, including hierarchical clustering, and discusses their applications in various fields.
“Data Mining: Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank: This book is a comprehensive introduction to data mining and machine learning, including a chapter on hierarchical clustering.

Books on Machine Learning

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Relataly articles on clustering and machine learning

Simple Clustering using K-means in Python: This article gives an overview of cluster analysis with k-means.
Clustering crypto markets using affinity propagation in Python: This article applies cluster analysis to crypto markets and creates a market map for various cryptocurrencies.
Addressing fairness in machine learning with the fairlearn library

The post How to Use Hierarchical Clustering For Customer Segmentation in Python appeared first on relataly.com.

Univariate Stock Market Forecasting using Facebook Prophet in Python

Florian Follonier — Thu, 15 Dec 2022 22:54:34 +0000

Have you ever wondered how Facebook predicts the future? Meet Facebook Prophet, the open-source time series forecasting tool developed by Facebook’s Core Data Science team. Built on top of the PyStan library, Facebook Prophet offers a simple and intuitive interface for creating forecasts using historical data. What sets Facebook Prophet apart is its highly modular design, allowing for a range of customizable components that can be combined to create a wide variety of forecasting models. This makes it perfect for modeling data with strong seasonal effects, like daily or weekly patterns, and it can handle missing data and outliers with ease. In this tutorial, we will take a closer look at the capabilities of Facebook Prophet and see how it can be used to make accurate predictions.

We begin with a brief discussion of how the Facebook Prophet decomposes a time series into different components. Then we turn to the hands-on part. you can use its model in Python to generate a stock market forecast. We will train our Facebook Prophet model using the historical price of the Coca-Cola stock. We will also cover different options to customize the model settings.

Disclaimer: This article does not constitute financial advice. Stock markets can be very volatile and are generally difficult to predict. Predictive models and other forms of analytics applied in this article only illustrate machine learning use cases.

Facebook Prophet – an open-source tool for time series forecasting

What is Facebook Prophet?

Facebook Prophet is a tool that can be used to make predictions about future events based on historical data. It was developed by Taylor and Letham, 2017, who later made it available as an open-source project. The authors developed Facebook Prophet to solve various business forecasting problems without requiring much prior knowledge. In this way, the framework addresses a significant problem many companies face today. They have various prediction problems (e.g., capacity and demand forecasting) but face a skill gap when it comes to generating reliable forecasts with techniques such as ARIMA or neural networks. Compared to that, Facebook Prophet requires minimal fine-tuning and can deal with various challenges, including seasonality, outliers, and changing trend lines. This allows Facebook Prophet to handle a wide range of forecasting problems flexibly. Before we dive into the hands-on part, let’s gain a quick overview of how Facebook Prophet works.

Also: Stock Market Prediction using Multivariate Time Series

time series forecasting with facebook prophet python tutorial

" data-image-caption="

time series forecasting with facebook prophet python tutorial

" data-large-file="https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball.png" src="https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball-1024x1024.png" alt="time series forecasting with facebook prophet python tutorial" class="wp-image-12356" srcset="https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball.png 1024w, https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball.png 300w, https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball.png 140w, https://www.relataly.com/wp-content/uploads/2023/02/an_ancient_prophet_looking_into_a_crystal_ball.png 768w" sizes="(max-width: 1024px) 100vw, 1024px" />

Time-series forecasting with Facebook Prophet. Image generated with Midjourney.

How Facebook Prophet Works

Facebook Prophet uses a technique called additive regression to model time series data. This involves breaking the time series into a series of components:

Trends
Seasonality
Holiday

Traditional time series methods such as (S)ARIMA base their prediction on a model that weights the linear sum of past observations or lags. Facebook’s Prophet is similar in that it uses a decreasing weight for past observations. This means current observations have a higher significance for the model than those that date back a long time. It then models each component separately using a combination of linear and non-linear functions. Finally, Facebook Prophet combines these components to form the complete forecast model. Let’s take a closer look at these components and how Facebook Prophet handles them.

A) Dealing with Trends

Time series often have a trendline. However, even more often, a time series will not follow a single trend, but it has several trend components that are separated by breakpoints. Facebook Prophet tries to handle these trends in several ways. First, the model tries to identify the breakpoints (knots) in a time series that divide different periods. Each breakpoint separates two periods with different trendlines. Facebook Prophet then uses these inflection points between periods to fit the model to the data and create the forecast. In addition, trendlines do not have to be linear but can also be logarithmic. This is all done automatically, but it is also possible to specify breakpoints manually.

B) Seasonality

Facebook Prophet works very well when the data shows a strong seasonal pattern. It uses Fourier transformations (adding different sine and cosine frequencies) to account for daily, weekly and yearly seasonality. The Facebook Prophet model is flexible on the type of data you have by allowing you to adjust the seasonal components of your data. By default, Facebook Prophet assumes daily data with weekly and yearly seasonal effects. If your data differentiates from this standard, for example, you have weekly data with monthly seasonality, then you need to adjust the number of terms accordingly.

C) Holiday

Every year, public holidays can lead to strong deviations in a time series; for example, thinking of computing power, demand more people will visit the Facebook website. The Facebook Prophet model also accounts for such special events by allowing us to specify binary indicators that mark whether a certain day is a public holiday. If you have other non-holiday events that occur yearly, you can use this indicator for the same purpose. Usually, Facebook Prophet will automatically remove outliers from the data. But if an outlier occurs on a day highlighted as a public holiday, Facebook Prophet will adjust its model accordingly.

Hyperparameter Tuning and Customization

Facebook Prophet includes additional optimization techniques, such as Bayesian optimization, to automatically tune the model’s hyperparameters, such as the length of the seasonal period, to improve its accuracy. Once the model is trained, it can be used to predict future values in the time series. However, users with a strong domain knowledge may prefer to tweak these parameters themselves, and Facebook Prophet provides several functions for this purpose. It also includes a range of tools for model evaluation and diagnostics, as well as for visualizing the model and the input data.

Also: Using Random Search to Tune the Hyperparameters of a Random Decision Forest with Python

Application Domains

Facebook Prophet is a powerful forecasting tool that has been specifically designed to make forecasting easy. As mentioned, Prophet is easy to use and can flexibly handle various forecasting problems. In addition, it requires very little preprocessing to generate accurate forecasts. As a result of these advantages, Facebook Prophet has been adopted by various application domains. Some possible application domains for Facebook Prophet include:

Sales forecasting: Facebook Prophet can be used to predict future sales of a product or service, based on historical sales data. This can be useful for businesses to plan their inventory and staffing, and to make informed decisions about future investments and growth.
Financial forecasting: Facebook Prophet can be used to predict future stock prices, currency exchange rates, or other financial metrics. This can be useful for investors and financial analysts to make informed decisions about the market.
Traffic forecasting: Facebook Prophet can be used to predict future traffic on a website or mobile app based on historical data. This can be useful for businesses to plan for capacity and optimize their servers and infrastructure.
Energy consumption forecasting: Facebook Prophet can be used to predict future energy consumption based on historical data. This can be useful for utilities and energy companies to plan for demand and optimize their generation and distribution.

When to Use Facebook Prophet?

Although Facebook Prophet is applicable in any domain where time series data is available, it is most effective when certain conditions are met. These include univariate time series data with prominent seasonal effects and an extensive historical record spanning multiple seasons. Facebook Prophet is especially beneficial when dealing with large quantities of historical data that require efficient analysis and quick, accurate predictions of future trends.

Also: Rolling Time Series Forecasting: Creating a Multi-Step Prediction

Using Facebook Prophet to Forecast the Coca-Cola Stock Price in Python

In this hands-on tutorial, we’ll use Facebook Prophet and Python to create a forecast for the Coca-Cola stock price. We have chosen Coca-Cola as an example because the Coca-Cola share is known to be a cyclical stock. As such, its chart reflects a seasonal pattern, different periods, and varying trend lines. We train our model on historical price data and then predict the next data points half-year in advance. In addition, we will discuss how we could finetune our model to improve the accuracy of the predictions further. This involves the following steps:

Collect historical stock data for CocaCola and familiarize ourselves with the data.
Use Facebook Prophet to fit a model to the data.
Use the model to make predictions about the future stock price of Coca-Cola.
Visualize model components and predictions.
Manually adjust the model to improve the model fit.

By following these steps, we will try to gain insights into the future performance of Coca-Cola stock. Let’s get started!

As always, you can find the code of this tutorial on the GitHub repository.

View on GitHub Relataly GitHub Repo

Prerequisites

Before you proceed, ensure that you have set up your Python environment (3.8 or higher) and the required packages. If you don’t have an environment, consider following this tutorial to set up the Anaconda environment.

Also, make sure you install all required Python packages. We will be working with the following standard Python packages:

pandas
seaborn
matplotlib

In addition, we will use the Facebook Prophet library that goes by the library name “prophet.” You can install these packages using the following commands:

pip install 
conda install  (if you are using the anaconda packet manager)

Step #1 Loading Packages and API Key

Let’s begin by loading the required Python packages and historical price quotes for the Coca-Cola stock. We will obtain the data from the yahoo finance API. Note that the API will return several columns of data, including, opening, average, and closing prices. We will only use the closing price.

# Tested with Python 3.8.8, Matplotlib 3.5, Seaborn 0.11.1, numpy 1.19.5, plotly 4.1.1, cufflinks 0.17.3, prophet 1.1.1, CmdStan 2.31.0
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
from math import log, exp 
from datetime import date, timedelta, datetime
import seaborn as sns
sns.set_style('white', {'axes.spines.right': False, 'axes.spines.top': False})
from scipy.stats import norm
from prophet import Prophet
from prophet.plot import add_changepoints_to_plot
import cmdstanpy
cmdstanpy.install_cmdstan()
cmdstanpy.install_cmdstan(compiler=True)
# Setting the timeframe for the data extraction
end_date =  date.today().strftime("%Y-%m-%d")
start_date = '2010-01-01'
# Getting quotes
stockname = 'Coca Cola'
symbol = 'KO'
# You can either use webreader or yfinance to load the data from yahoo finance
# import pandas_datareader as webreader
# df = webreader.DataReader(symbol, start=start_date, end=end_date, data_source="yahoo")
import yfinance as yf #Alternative package if webreader does not work: pip install yfinance
df = yf.download(symbol, start=start_date, end=end_date)
# Quick overview of dataset
print(df.head())

[*********************100%***********************]  1 of 1 completed
                 Open       High        Low      Close  Adj Close    Volume
Date                                                                       
2010-01-04  28.580000  28.610001  28.450001  28.520000  19.081614  13870400
2010-01-05  28.424999  28.495001  28.070000  28.174999  18.850786  23172400
2010-01-06  28.174999  28.219999  27.990000  28.165001  18.844103  19264600
2010-01-07  28.165001  28.184999  27.875000  28.094999  18.797268  13234600
2010-01-08  27.730000  27.820000  27.375000  27.575001  18.449350  28712400

Once we have downloaded the data, we create a line plot of the closing price to familiarize ourselves with the time series data. Note that Facebook Prophet works on a single input signal only (univariate data). This input will be the closing price. For illustration purposes, we add a moving average to the chart. However, the moving average makes it easier to spot trends and seasonal patterns, it will not be used to fit the model.

# Visualize the original time series
rolling_window=25
y_a_add_ma = df['Close'].rolling(window=rolling_window).mean() 
fig, ax = plt.subplots(figsize=(20,5))
sns.lineplot(data=df, x=df.index, y='Close', color='skyblue', linewidth=0.5, label='Close')
sns.lineplot(data=df, x=df.index, y=y_a_add_ma, 
    linewidth=1.0, color='royalblue', linestyle='--', label=f'{rolling_window}-Day MA')

The chart shows a long-term upward trend interrupted by phases of downturns. In addition, between 2010 and 2018, we can see some cyclical movements. At some points, we can spot clear breakpoints, for example, in 2019 and mid-2020.

Step #2 Preparing the Data

Next, we prepare our data for model training. Propjet has a strict condition on how the input columns must be named. In order to use Facebook Prophet, your data needs to be in a time series format with the time as the index and the value as the first column. In addition, column names need to adhere to the following naming convention:

ds for the timestamp
y for the metric columns, which in our case is the closing price

So before we proceed, we must rename the columns in our dataframe. In addition, we will remove the index and drop NA values.

df_x = df[['Close']].copy()
df_x['ds'] = df.index.copy()
df_x.rename(columns={'Close': 'y'}, inplace=True)
df_x.reset_index(inplace=True, drop=True)
df_x.dropna(inplace=True)
df_x.tail(9)

		y			ds
3257	63.139999	2022-12-09
3258	63.970001	2022-12-12
3259	63.990002	2022-12-13

Now we have a simple dataframe with ds and y as the only variables.

Step #3 Model Fitting and Forecasting

Next, let’s fit our forecasting model to the time series data. Afterward, we can make predictions about future values in the series. However, before we do this, we need to define our prediction interval.

3.1 Setting the Prediction Interval

The prediction interval is a measure of uncertainty in a forecast made with Facebook Prophet. It indicates the range within which the true value of the forecasted quantity is expected to fall a certain percentage of the time. For example, a 95% prediction interval means that the true value of the forecasted quantity is expected to fall within the given range 95% of the time.

In Facebook Prophet, the prediction interval is controlled by the interval_width parameter, which can be set when calling the predict method. The default value for interval_width is 0.80. This means that the true value of the forecasted quantity is expected to fall within the prediction interval 80% of the time. We can adjust the value of interval_width to change the width of the prediction interval as desired. In the example below, we use a prediction interval of 0.85.

3.2 Fit the Model

Next, let’s fit our model and generate a one-year forecast. First, we need to instantiate our model with by calling Prophet(). Then we use model.fit(df) to fit this model to the historical price quotes of the Coca-Cola stock. Once, we have done that, we use the model instance model.make_future_dataframe() to create an extended dataframe (future_df). This dataframe has been extended with records for a one-year period. The records are empty dummy values ready to be filled with the real forecast. We then pass this dummy dataframe to the model.predict(df) function, Facebook Prophet creates the forecast and fills up the dummy dataframe with the forecast values.

For the sake of reusability, I have encapsulated the entire process into a wrapper function. This will allow us to run quick experiments with different parameter values.

# This function fits the prophet model to the input data and generates a forecast
def fit_and_forecast(df, periods, interval_width, changepoint_range=0.8):
    # set the uncertainty interval
    Prophet(interval_width=interval_width)
    # Instantiate the model
    model = Prophet(changepoint_range=changepoint_range)
    # Fit the model
    model.fit(df)
    # Create a dataframe with a given number of dates
    future_df = model.make_future_dataframe(periods=periods)
    # Generate a forecast for the given dates
    forecast_df = model.predict(future_df)
    #print(forecast_df.head())
    return forecast_df, model, future_df
# Forecast for 365 days with full data
forecast_df, model, future_df = fit_and_forecast(df_x, 365, 0.95)
print(forecast_df.columns)
forecast_df[['yhat_lower', 'yhat_upper', 'yhat']].head(5)

Index(['ds', 'trend', 'yhat_lower', 'yhat_upper', 'trend_lower', 'trend_upper',
       'additive_terms', 'additive_terms_lower', 'additive_terms_upper',
       'weekly', 'weekly_lower', 'weekly_upper', 'yearly', 'yearly_lower',
       'yearly_upper', 'multiplicative_terms', 'multiplicative_terms_lower',
       'multiplicative_terms_upper', 'yhat'],
      dtype='object')
	yhat_lower	yhat_upper	yhat
0	24.468273	28.944286	26.691615
1	24.496074	29.146425	26.706924
2	24.513424	28.829159	26.682213
3	24.358048	28.767209	26.667476
4	24.487963	28.839966	26.666242

Voila, we have generated a one-year forecast.

Step #4 Analyzing the Forecast

Next, let’s visualize our forecast and discuss what we see. The most simple way is to create the plot with a standard Facebook Prophet function.

Also: Measuring Regression Errors with Python

model.plot(forecast_df, uncertainty=True)

So what do we see? The forecast shows that our model does not simply predict a straight line and instead has generated a more sophisticated forecast that displays an upward cyclical trend with higher highs and higher lows.

The black dots are the data points from the historical data to which we have fit our model.
The dark blue line is the most likely path.
The light blue lines are the upper and lower boundaries of the prediction interval. We have set the prediction interval to 0.85, which means there is a probability of 85% the actual values will fall into this range.
In total, the model seems confident that the price of Coca-Cola stock will rise within the next year (no financial advice). However, as we will see later, the forecast depends on where the model sees the breakpoints.

In case, you want to create a custom plot, you can use the function below.

# Visualize the Forecast
def visualize_the_forecast(df_f, df_o):
    rolling_window = 20
    yhat_mean = df_f['yhat'].rolling(window=rolling_window).mean() 
    # Thin out the ground truth data for illustration purposes
    df_lim = df_o
    # Print the Forecast
    fig, ax = plt.subplots(figsize=[20,7])
    sns.lineplot(data=df_f, x=df_f.ds, y=yhat_mean, ax=ax, label='predicted path', color='blue')
    sns.lineplot(data=df_lim, x=df_lim.ds, y='y', ax=ax, label='ground_truth', color='orange')
    #sns.lineplot(data=df_f, x=df_f.ds, y='yhat_lower', ax=ax, label='yhat_lower', color='skyblue', linewidth=1.0)
    #sns.lineplot(data=df_f, x=df_f.ds, y='yhat_upper', ax=ax, label='yhat_upper', color='coral', linewidth=1.0)
    plt.fill_between(df_f.ds, df_f.yhat_lower, df_f.yhat_upper, color='lightgreen')
    plt.legend(framealpha=0)
    ax.set(ylabel=stockname + " stock price")
    ax.set(xlabel=None)
visualize_the_forecast(forecast_df, df_x)

Step #5 Analyzing Model Components

We can gain a better understanding of different model components by using the plot_components function. This method creates a plot showing the trend, weekly and yearly seasonality, and any additional user-defined seasonalities of the forecast. This can be useful for understanding the underlying patterns in the data and for diagnosing potential issues with the model.

model.plot_components(forecast_df)

The first chart shows the trendlines that the model sees within different periods. The trendlines are separated by breakpoints about, which we will talk in the next section. When we look at the second plot, we can see no price changes during the weekend. This is plausible, considering that the stock markets are closed over the weekend. The third chart is most interesting, as it shows that the model has recognized some yearly seasonality with two peaks in April and August, as well as lows in March and October.

Step #6 Adjusting the Changepoints of our Facebook Prophet Model

Let’s take a closer look at the changepoints in our model. Changepoints are the points in time where the trend of the time series is expected to change, and Facebook Prophet’s algorithm automatically detects these points and adapts the model accordingly. Changepoints are important to Facebook Prophet because they allow the model to capture gradual changes or shifts in the data. By identifying and incorporating changepoints into the forecasting model, Facebook Prophet can make more accurate predictions. Changepoints can also help to identify potential outliers in the data.

6.1 Checking Current Changepoints

We can illustrate the changepoints in our model with the add_changepoints_to_plot method. The method adds vertical lines to a plot to indicate the locations of the changepoints in the data. By plotting the changepoints on a graph, we can visually identify when these changes in trend occur and potentially diagnose any issues with our model.

# Printing the ChangePoints of our Model
forecast_df, model, future_df = fit_and_forecast(df_x, 365, 1.0)
axislist = add_changepoints_to_plot(model.plot(forecast_df).gca(), model, forecast_df)

The chart above shows that our model has identified several changepoints in the historical data. However, it has only searched for changepoints within 80% of the time series. As a result, the algorithm hasn’t identified any change points in the most recent years after 2020. We can adjust the changepoints with the changepoint_range (default = 80%) variable. This is what we will do in the next section.

6.2 Adjusting Changepoints

We can adjust the range within which Facebook Prophet looks for changepoints with the “changepoint_range”. It is specified as a fraction of the total duration of the time series. For example, if changepoint_range is set to 0.8 and the time series spans 10 years, the algorithm will look for changepoints within the last 8 years of the series.

By default, changepoint_range is set to 0.8, which means that the algorithm will look for changepoints within the last 80% of the time series. We can adjust this value depending on the characteristics of our data and our desired level of flexibility in the model.

Increasing the value of changepoint_range will allow the algorithm to identify more changepoints and potentially improve the fit of the model, but it may also increase the risk of overfitting. Conversely, decreasing the value of changepoint_range will reduce the number of changepoints detected and may improve the model’s ability to generalize to new data, but it may also reduce the accuracy of the forecast.

Let’s fit our model again, but this time we let Facebook Prophet search for changepoints within the entire time series (changepoint_range=1.0).

# Adjusting ChangePoints of our Model
forecast_df, model, future_df = fit_and_forecast(df_x, 365, 1.0, 1.0)
axislist = add_changepoints_to_plot(model.plot(forecast_df).gca(), model, forecast_df)

The plot above shows that Facebook Prophet has now identified several additional breakpoints in the time series. As a result, the forecast has become rather pessimistic, as Facebook Prophet gave more weight to recent changes.

Finally, it is worth mentioning that it is possible to add changepoints for specific dates manually. You can try this out using “model.changepoints(series)”. The function takes a series of timestamps as the parameter value.

Summary

Get ready to dive into the world of stock market prediction with Facebook Prophet! In this article, we’ll show you how to leverage the power of this amazing tool to forecast time series data, using Coca-Cola’s stock as an example. We’ll guide you through the process of fitting a curve to univariate time series data and fine-tuning the initial breakpoints and trendlines to enhance model performance. With Facebook Prophet’s automatic trend identification algorithm, you’ll be able to easily adapt to changes in the data over time.

Also: Mastering Multivariate Stock Market Prediction with Python

As a data scientist, you’ll appreciate how easy it is to use Facebook Prophet and how it consistently outperforms other models. With its straightforward interface and impressive accuracy, this tool is a must-have for your forecasting toolkit. And we’re always looking for feedback from our audience, so let us know what you think! We’re committed to improving our content to provide the best learning experience possible.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Other Methods for Time Series Forecasting

The post Univariate Stock Market Forecasting using Facebook Prophet in Python appeared first on relataly.com.