Building a Conversational Voice Bot with Azure OpenAI and Python: The Future of Human and Machine Interaction

OpenAI and Microsoft have just released a new generation of text-to-speech models that take synthetic speech to a new level. In my latest project I have combined these new models with Azure OpenAI’s ingenuine conversation capacity. The result is a conversational voice bot that uses Generative AI to converse with users in natural spoken language.

This article describes the Python implementation of this project. The bot is designed to understand spoken language and process it through OpenAI GPT-4. It responds with contextually aware dialogue, all in natural-sounding speech. This seamless integration facilitates a conversational flow that feels intuitive and engaging. The voice processing capacities enable users to have meaningful exchanges with the bot as if they were conversing with another person. Testing the bot was a lot of fun. It felt a bit like the iconic scene from Iron Man where the hereo converses with an AI assistant.

Here is an example of the audio response quality:

Also:

Understanding the Voice Bot

The magic begins with the user speaking to the bot. Azure Cognitive Services transcribes the spoken words into text, which is then fed into Azure’s OpenAI service. Here, the text is processed, and a response is generated based on the conversation’s context and history. Finally, the text-to-speech model transforms the response back into speech, completing the cycle of interaction. This process showcases the potential of AI in understanding and participating in human-like conversations.

Prerequisites & Azure Service Integration

Our conversational voice bot is built upon two pivotal Azure services: Cognitive Speech Services and OpenAI. Billing of these services is pay-per-use. Unless you process large numbers of requests, the costs for experimenting with these services is relatively low.

Azure Cognitive Speech Services

Azure AI Speech Services (previously Cognitive Speech Services) provide the tools necessary for speech-to-text conversion, enabling our voice bot to understand spoken language. This service boasts advanced speech recognition capabilities, ensuring accurate transcription of user speech into text. Furthermore, it powers the text-to-speech synthesis that transforms generated text responses back into natural-sounding voice. This allows for a truly conversational experience.

The newest generation of OpenAI text-to-speech models is now also availbale in Azure AI Speech. These models can synthesize voices in unknown level of quality. I am most impressed by its capability to alter intonation dynamically to express emotions.

Azure OpenAI Service

At the heart of our project lies Azure’s OpenAI service, which uses the power of models like GPT-4 context-aware responses. Once Azure Cognitive Speech Services transcribe the user’s speech into text, this text is sent to OpenAI. The OpenAI model then processes the input and generates a completion. The service’s ability to understand context and generate engaging responses is what makes our voice bot remarkably human-like.

Implementation: Detailed Code Walkthrough

Let’s start with the implementation! We kick things off with Azure Service Authentication, where we set up our conversational voice bot to communicate with Azure and OpenAI’s advanced services. Then, Speech Recognition steps in, acting as our bot’s ears, converting spoken words into text. Next up, Processing and Response Generation uses OpenAI’s GPT-4 to turn text into context-aware responses. Speech Synthesis then gives our bot a voice, transforming text responses back into spoken words for a natural conversational flow. Finally, Managing the Conversation keeps the dialogue coherent and engaging. Through these steps, we create a voice bot that offers an intuitive and engaging conversational experience. Let’s discuss these sections one by one.

As always, you can find the code on github:

Step #1 Azure Service Authentication

First off, we kick things off by getting our ducks in a row with Azure Service Authentication. This is where the magic starts, setting the stage for our conversational voice bot to interact with Azure’s brainy suite of Cognitive Services and the fantastic OpenAI models. By fetching API keys and setting up our service regions, we’re essentially giving our project the keys to the kingdom.

For using dotenv, you need to create an .env file in your root folder. Here is more information on how this works.

import os
from dotenv import load_dotenv
import azure.cognitiveservices.speech as speechsdk
from openai import AzureOpenAI

# Load environment variables from .env file
load_dotenv()

# Constants from .env file
SPEECH_KEY = os.getenv('SPEECH_KEY')
SERVICE_REGION = os.getenv('SERVICE_REGION')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
OPENAI_ENDPOINT = os.getenv('OPENAI_ENDPOINT')

# Azure Speech Configuration
speech_config = speechsdk.SpeechConfig(subscription=SPEECH_KEY, region=SERVICE_REGION)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
speech_config.speech_recognition_language="en-US"

# OpenAI Configuration
openai_client = AzureOpenAI(
    api_key=OPENAI_API_KEY,
    api_version="2023-12-01-preview",
    azure_endpoint=OPENAI_ENDPOINT
)

Step #2 Speech Recognition

The user’s spoken input is captured and converted into text using Azure’s Speech-to-Text service. This involves initializing the speech recognition service with Azure credentials and configuring it to listen for and transcribe spoken language in real-time.

def recognize_from_microphone():
    # Configure the recognizer to use the default microphone.
    audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)
    # Create a speech recognizer with the specified audio and speech configuration.
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    print("Speak into your microphone.")
    # Perform speech recognition and wait for a single utterance.
    speech_recognition_result = speech_recognizer.recognize_once_async().get()

    # Process the recognition result based on its reason.
    if speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:
        print("Recognized: {}".format(speech_recognition_result.text))
        # Return the recognized text if speech was recognized.
        return speech_recognition_result.text
    elif speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:
        print("No speech could be recognized: {}".format(speech_recognition_result.no_match_details))
    elif speech_recognition_result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = speech_recognition_result.cancellation_details
        print("Speech Recognition canceled: {}".format(cancellation_details.reason))
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("Error details: {}".format(cancellation_details.error_details))
            print("Did you set the speech resource key and region values?")
    # Return 'error' if recognition failed or was canceled.
    return 'error'

Step #3 Processing and Response Generation

Once we’ve got your words neatly transcribed, it’s time for the Processing and Response Generation phase. This is where OpenAI steps in, acting like the brain behind the operation. It takes your spoken words, now in text form, and churns out responses that are nothing short of conversational gold. We nudge OpenAI’s GPT-4 into generating replies that feel as natural as chatting with a close friend over coffee.

def openai_request(conversation, sample = [], temperature=0.9, model_engine='gpt-4'):
    # Initialize AzureOpenAI client with keys and endpoints from Key Vault.
    
    
    # Send a request to Azure OpenAI with the conversation context and get a response.
    response = openai_client.chat.completions.create(model=model_engine, messages=conversation, temperature=temperature, max_tokens=500)
    return response.choices[0].message.content

Step #4 Speech Synthesis

Next up, we tackle Speech Synthesis. If the previous step was the brain, consider this the voice of our operation. Taking the AI-generated text, we transform it back into speech—like turning lead into gold, but for conversations. Through Azure’s Text-to-Speech service, we’re able to give our bot a voice that’s not only clear but also surprisingly human-like.

def synthesize_audio(input_text):
    # Define SSML (Speech Synthesis Markup Language) for input text.
    ssml = f"""
        <speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='en-US'>
            <voice name='en-US-OnyxMultilingualNeuralHD'>
                <p>
                    {input_text}
                </p>
            </voice>
        </speak>
        """
    
    audio_filename_path = "audio/ssml_output.wav"  # Define the output audio file path.
    print(ssml)
    # Synthesize speech from the SSML and wait for completion.
    result = speech_synthesizer.speak_ssml_async(ssml).get()

    # Save the synthesized audio to a file if synthesis was successful.
    if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
        with open(audio_filename_path, "wb") as audio_file:
            audio_file.write(result.audio_data)
        print(f"Speech synthesized and saved to {audio_filename_path}")
    elif result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = result.cancellation_details
        print(f"Speech synthesis canceled: {cancellation_details.reason}")
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print(f"Error details: {cancellation_details.error_details}")


# Create the audio directory if it doesn't exist.
if not os.path.exists('audio'):
    os.makedirs('audio')

Step #5 Managing the Conversation

Finally, we bring it all together in the Managing the Conversation step. This is where we ensure the chat keeps rolling, looping through listening, thinking, and speaking. We keep track of what’s been said to keep the conversation relevant and engaging.

The system message below makes the bot talk like a pirate. But you can easily adjust the system message and this way customize the bots behavior.

conversation=[{"role": "system", "content": "You are a helpful assistant that talks like pirate. If you encounter any issues, just tell a pirate joke or a story."}]

while True:
    user_input = recognize_from_microphone()  # Recognize user input from the microphone.
    conversation.append({"role": "user", "content": user_input})  # Add user input to the conversation context.

    assistant_response = openai_request(conversation)  # Get the assistant's response based on the conversation.

    conversation.append({"role": "assistant", "content": assistant_response})  # Add the assistant's response to the context.
    
    print(assistant_response)
    synthesize_audio(assistant_response)  # Synthesize the assistant's response into audio.

Throughout these steps, the conversation’s context is managed meticulously to ensure continuity and relevance in the bot’s responses, making the interaction feel more like a dialogue between humans.

Current Challenges

Despite the promising capabilities of our voice bot, the journey through its development and interaction has presented a few challenges that underscore the complexity of human-machine communication.

Slow Response Time

One of the notable hurdles is the slow response time experienced during interactions. The process, from speech recognition through to response generation and back to speech synthesis, involves several steps that can introduce latency. This latency can detract from the user experience, as fluid conversations typically require quick exchanges. Optimizing the interaction flow and exploring more efficient data processing methods may mitigate this issue in the future.

Handling Pauses in Speech

Another challenge lies in the bot’s handling of longer pauses while speaking. The current setup does not always allow users to pause thoughtfully without triggering the end of their input. This may sometimes lead to a situation where the model cuts off speech prematurely. This limitation points to the need for more sophisticated speech recognition algorithms capable of distinguishing between a conversational pause and the end of an utterance.

Summary

This article has shown how you can build a conversational voice bot in Python with the latest pretrained AI models. The project showcases the incredible potential of combining Azure Cognitive Services with OpenAI’s conversational models. I hope by now you understand the technical feasibility of creating voice-based applications and how they open up a world of possibilities for human-machine interaction. As we continue to refine and enhance this technology, the line between talking to a machine and conversing with a human will become ever more blurred, leading us into a future where AI companionship becomes reality.

This exploration of Azure Cognitive Services and OpenAI’s integration within a voice bot is just the beginning. As AI continues to evolve, the ways in which we interact with technology will undoubtedly transform, making our interactions more natural, intuitive, and, most importantly, human.

Also: 9 Business Use Cases of OpenAI’s ChatGPT

Sources and Further Reading

Author

  • Florian Follonier

    Hi, I am Florian, a Zurich-based Cloud Solution Architect for AI and Data. Since the completion of my Ph.D. in 2017, I have been working on the design and implementation of ML use cases in the Swiss financial sector. I started this blog in 2020 with the goal in mind to share my experiences and create a place where you can find key concepts of machine learning and materials that will allow you to kick-start your own Python projects.

    View all posts
0 0 votes
Article Rating
Subscribe
Notify of

0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x