Building Voice Agents With OpenAI Agent SDK

Eteimorde Youdiowei and Neurl Creators

Apr 10, 2025

Article voiceover

0:00

-14:35

TL;DR

Voice is the most natural way humans interact with each other daily. We speak to communicate, give instructions, and express emotions, and we effortlessly understand one another’s speech.

For as long as humanity has existed, voice has been our primary means of communication. Now, this mode of interaction is shaping the way we interact with computers. Over the years, speech understanding and generation advancements have transformed human-computer interactions. The latest voice models from OpenAI are a testament to how far speech AI has evolved.

But beyond simple interactions, can we push the boundaries of these powerful voice capabilities even further?

The answer is yes. By combining advanced voice technology with the agentic capabilities of Large Language Models (LLMs), we can create something more: Voice Agents.

Voice Agents are AI models that combine agentic reasoning with advanced voice capabilities. These agents can understand speech just like we do, perform actions based on spoken instructions, and respond naturally to users.

In this article, we will build a Voice Agent using the OpenAI Agent SDK. By the end, you will learn:

The key components of a Voice Agent
How to use the OpenAI Agent SDK and extend it with voice capabilities
The concepts of Voice Workflows and Voice Pipelines within the SDK

A quick primer on Voice Agent

A typical AI agent operates by taking instructions in the form of a prompt. The LLM then decides on the appropriate tool or action to perform. Once the action is completed, the agent responds to the user via text.

*An AI agent typically receives a text input, processes it to determine the appropriate action, executes the task, and then responds with a text output.*

A Voice Agent follows the same process but with additional speech-processing components. First, a speech-to-text (STT) model converts spoken input into text. The transcribed text is then passed to an LLM, which processes the instructions and generates a response.

Voice Agents take speech as input, using a Speech-to-Text model to convert the audio into text that the agent can understand. Then, a Text-to-Speech model is used to generate a spoken response for the user.

Instead of simply returning text, however, the output is sent to a text-to-speech (TTS) model, which converts it back into audio for the user.

This modular, chained approach to building Voice Agents offers flexibility and scalability. Developers can easily swap out individual components when updating or improving the system.

For example, if a Voice Agent was initially built using OpenAI’s Whisper for speech-to-text conversion, but OpenAI later releases a more advanced model like gpt-4o-transcribe, you can seamlessly replace Whisper with the new model.

The same modularity applies to both the LLM and the text-to-speech components, making it easy to upgrade and refine the Voice Agent over time.

The OpenAI Agent SDK

The OpenAI Agent SDK is the production-ready version of OpenAI’s previous research project, OpenAI Swarm. Initially designed to support text-based agents, the SDK has since been enhanced to include voice capabilities, enabling the development of powerful Voice Agents. It provides developers with a robust set of tools to create agents that can understand instructions, leverage various tools, and collaborate with other agents to accomplish complex tasks efficiently.

Components of the OpenAI Agent SDK

The SDK comprises three primary components:

Agents: These are instances of large language models (LLMs) configured with specific instructions and equipped with tools to perform designated tasks. Agents serve as the central entities that process inputs and generate appropriate actions or responses.
Tools: Tools extend the functionality of agents by enabling them to perform actions such as fetching data, executing code, calling external APIs, and interacting with external systems. The SDK supports various types of tools, including hosted tools, function tools, and even other agents acting as tools.
Runner: The Runner is responsible for executing the agent loop. It has access to both the agents and the user inputs. During the agent loop, the agents can perform handoffs (designating tasks to other agents) and use tools, all within the loop.

The Voice AI Workflow

To extend the capabilities of the Agent SDK to support voice, OpenAI introduced two key concepts: the Voice Pipeline and Voice Workflow.

The Voice Agent process starts with user speech, which is transcribed by a Speech-to-Text model. The transcribed text is then processed by the agent's workflow, where it interprets instructions and generates a response. Finally, the response is converted back into speech using a Text-to-Speech model, creating a seamless voice interaction.

The Workflows

The voice workflow consists of the agent and the tools used during its execution. It primarily expects text input, and the workflow processes this input to generate a text output. In this regard, the voice workflow operates similarly to a standard agent workflow. However, its true power is unlocked when combined with the Voice Pipeline.

The Pipeline

The pipeline encompasses the entire Voice Agent and includes the Speech-to-Text model, the voice workflow, and the Text-to-Speech model. As the name suggests, the pipeline first converts the user’s speech to text using the Speech-to-Text model, then passes it through the workflow, and finally converts the resulting text back into speech using the Text-to-Speech model. The pipeline coordinates all these components, ensuring smooth and seamless operation throughout the Voice Agent's process.

Building the Voice Workflow

In this section, we'll walk through the process of building a voice-enabled workflow using the OpenAI Agent SDK. This involves setting up the necessary dependencies, creating agents and their associated tools, defining the agents, and implementing the workflow that integrates speech-to-text and text-to-speech functionalities.

Installing the Voice Dependencies for Agent SDK

To begin, ensure that you have the OpenAI Agent SDK installed along with the voice dependencies. This can be achieved by running the following command:

pip install 'openai-agents[voice]'

This command installs the Agent SDK along with the necessary components to handle voice functionalities, including speech-to-text and text-to-speech capabilities. For more details, refer to the OpenAI Agents SDK Quickstart Guide.

Creating the Agent's Tool

Agents in the OpenAI Agent SDK can be equipped with tools that extend their capabilities. In this example, we'll create a tool that retrieves the current time in a specified time zone. Here's how you can define this tool:

Python

from datetime import datetime
from zoneinfo import ZoneInfo
from agents import function_tool

@function_tool
def get_time_in_timezone(timezone: str) -> str:
    """
    Given a timezone, this tool returns the current time in that time zone.
    """
    try:
        tz = ZoneInfo(timezone)
        current_time = datetime.now(tz)
        return current_time.strftime("%Y-%m-%d %H:%M:%S %Z%z")
    except ValueError:
        return "Invalid time zone. Please provide a valid time zone name."

In this code:

We import the necessary modules: datetime for handling date and time, and ZoneInfo for time zone conversions.
We define the get_time_in_timezone function and decorate it with @function_tool, making it a tool that agents can use.
The function attempts to create a ZoneInfo object for the given time zone and retrieves the current time in that zone. If the time zone is invalid, it returns an error message.

This tool allows agents to access current time information based on user-specified time zones. For more information on creating tools, see the OpenAI Agents SDK documentation on Tools.

Defining the Agents

With the tool created, we can now define the agents that will use it. In this example, we create two agents: one for general assistance and another for handling Spanish language interactions.

Python

from agents import Agent
from agents.extensions.handoff_prompt import prompt_with_handoff_instructions

# Define the Spanish-speaking agent
spanish_agent = Agent(
    name="Spanish",
    handoff_description="A Spanish-speaking agent.",
    instructions=prompt_with_handoff_instructions(
        "You're speaking to a human, so be polite and concise. Speak in Spanish."
    ),
    model="gpt-4o-mini",
)

# Define the main assistant agent
agent = Agent(
    name="Assistant",
    instructions=prompt_with_handoff_instructions(
        "You're speaking to a human, so be polite and concise. If the user speaks in Spanish, hand off to the Spanish agent."
    ),
    model="gpt-4o-mini",
    handoffs=[spanish_agent],
    tools=[get_time_in_timezone],
)

In this setup:

We import the Agent class and the prompt_with_handoff_instructions function.
We create spanish_agent with specific instructions to interact politely in Spanish.
We create the main agent with instructions to handle general interactions and delegate Spanish-language interactions to spanish_agent.
The agent is equipped with the get_time_in_timezone tool, enabling it to provide time-related information.

The Actual Workflow

The workflow orchestrates the interaction between speech input, agent processing, and speech output. We define a VoiceAgentWorkflow class that inherits from VoiceWorkflowBase and implements the run method.

Python

from agents import Runner
from agents.voice import VoiceWorkflowBase, VoiceWorkflowHelper
from typing import Callable
from collections.abc import AsyncIterator

class VoiceAgentWorkflow(VoiceWorkflowBase):
    def __init__(self, on_start: Callable[[str], None]):
        """
        Args:
            on_start: A callback that is called when the workflow starts. The transcription
                is passed in as an argument.
        """
        self._input_history: list = []
        self._current_agent = agent
        self._on_start = on_start

    async def run(self, transcription: str) -> AsyncIterator[str]:
        self._on_start(transcription)

        # Add the transcription to the input history
        self._input_history.append(
            {
                "role": "user",
                "content": transcription,
            }
        )

        # Run the agent
        result = Runner.run_streamed(self._current_agent, self._input_history)

        async for chunk in VoiceWorkflowHelper.stream_text_from(result):
            yield chunk

        # Update the input history and current agent
        self._input_history = result.to_input_list()
        self._current_agent = result.last_agent

In this workflow:

The __init__ method initializes the workflow with an on_start callback, which is invoked at the beginning of the workflow with the transcription text.
The run method processes the transcription:
- It calls the on_start callback with the transcription.
- Adds the transcription to the input_history.
- Executes the agent using Runner.run_streamed, passing the input_history.
- Streams the agent's response text using VoiceWorkflowHelper.stream_text_from.
- Updates the input_history and sets the current_agent for subsequent interactions.

This workflow integrates the agent's processing capabilities with speech input and output, enabling a seamless voice interaction experience. For more information on workflows, see the OpenAI Agents SDK Workflow Reference.

By following these steps, you've created a voice-enabled workflow that leverages the OpenAI Agent SDK's capabilities to process speech input, interact through agents, and produce speech output.

Building a Real-Time Voice Agent

Now that we have built out the workflow, let's put everything together into a pipeline that will serve as our voice agent.

In this section, we'll guide you through the process of building a real-time voice agent using the OpenAI Agent SDK and the sounddevice library. This agent will capture audio from the microphone, process it through a voice pipeline, and play the response back to the user.

Defining the Attributes

First, we define the necessary attributes for our application, including audio settings and components for audio input and output.

Python

import asyncio
import numpy as np
import sounddevice as sd
from agents.voice import StreamedAudioInput, VoicePipeline
from workflow import VoiceAgentWorkflow

# Constants for audio settings
CHUNK_LENGTH_S = 0.05  # 50ms
SAMPLE_RATE = 24000
FORMAT = np.int16
CHANNELS = 1

class RealtimeCLIApp:
    def __init__(self):
        self.should_send_audio = asyncio.Event()
        self.pipeline = VoicePipeline(
            workflow=VoiceAgentWorkflow(on_start=self._on_transcription),
            stt_model="gpt-4o-transcribe",
            tts_model="gpt-4o-mini-tts"
        )
        self._audio_input = StreamedAudioInput()
        self.audio_player = sd.OutputStream(
            samplerate=SAMPLE_RATE,
            channels=CHANNELS,
            dtype=FORMAT
        )

In this setup:

CHUNK_LENGTH_S: Duration of each audio chunk in seconds.
SAMPLE_RATE: Sampling rate for audio recording.
FORMAT: Data format for audio samples.
CHANNELS: Number of audio channels (1 for mono).
RealtimeCLIApp: Main application class that initializes the audio pipeline and input/output streams.

Defining the Methods

Next, we define the methods that handle various aspects of the voice agent's operation. For now, these methods will contain pass statements, indicating that their implementations will be provided later.

Python

class RealtimeCLIApp:
    # ... [Previous code]

    def _on_transcription(self, transcription: str):
        pass

    async def start_voice_pipeline(self):
        pass

    async def send_mic_audio(self):
        pass

    async def run(self):
        pass

_on_transcription: Callback method invoked when transcription is received.
start_voice_pipeline: Initializes and manages the voice pipeline.
send_mic_audio: Captures audio from the microphone.
run: Main method to run the application.

Working with sounddevice

The sounddevice library allows for real-time audio input and output in Python. It's cross-platform and provides a simple interface for audio streaming.

To install sounddevice, use:

Unset

pip install sounddevice

For more details, refer to the sounddevice documentation.

Implementing the Methods

Now, we implement the methods defined earlier to handle audio processing and application logic.

The `start_voice_pipeline` Method

This method starts the audio player and initiates the voice pipeline, processing audio input and output.

Python

async def start_voice_pipeline(self):
    try:
        self.audio_player.start()
        self.result = await self.pipeline.run(self._audio_input)

        async for event in self.result.stream():
            if event.type == "voice_stream_event_audio":
                self.audio_player.write(event.data)
            elif event.type == "voice_stream_event_lifecycle":
                print(f"Lifecycle event: {event.event}")
    except Exception as e:
        print(f"Error: {e}")
    finally:
        self.audio_player.close()

In this method:

Starts the audio player.
Runs the voice pipeline with the audio input.
Processes incoming audio events, playing audio data or handling lifecycle events.
Ensures the audio player is closed after processing.

The `send_mic_audio` Method

This method captures audio from the microphone and feeds it into the audio input stream.

Python

async def send_mic_audio(self):
    stream = sd.InputStream(channels=CHANNELS, samplerate=SAMPLE_RATE, dtype="int16")
    stream.start()
    read_size = int(SAMPLE_RATE * 0.02)  # 20ms chunks

    try:
        while True:
            if stream.read_available < read_size:
                await asyncio.sleep(0)
                continue

            await self.should_send_audio.wait()
            data, _ = stream.read(read_size)
            await self._audio_input.add_audio(data)
            await asyncio.sleep(0)
    except KeyboardInterrupt:
        pass
    finally:
        stream.stop()
        stream.close()

In this method:

Initializes an input stream to capture audio.
Reads audio data in 20ms chunks.
Waits for the signal to send audio data.
Adds the captured audio to the audio input stream.
Handles cleanup on interruption.

The run Method

The run method manages user input to start/stop recording and to quit the application.

Python

async def run(self):
    print("Press 'K' to start/stop recording, 'Q' to quit.")
    asyncio.create_task(self.start_voice_pipeline())
    asyncio.create_task(self.send_mic_audio())

    while True:
        cmd = await asyncio.to_thread(input)  # Run input in a separate thread
        cmd = cmd.strip().lower()
        if cmd == 'q':
            break
        elif cmd == 'k':
            if self.should_send_audio.is_set():
                self.should_send_audio.clear()
                print("Recording stopped.")
            else:
                self.should_send_audio.set()
                print("Recording started.")

In this method:

Informs the user of available commands.
Starts the voice pipeline and audio capture tasks.
Monitors user input to control recording and exit the application.
Share

Testing the Voice Agent

Now that our Voice Agent is fully set up with both the pipeline and workflow in place, it’s time to test it out.

You can find the complete code here. To run the agent, you’ll need to grab your OpenAI API key and set it as an environment variable:

Unset

export OPENAI_API_KEY=sk-...

Our workflow and pipeline will be stored in two files:

workflow.py: contains the workflow logic
main.py: serves as the entry point for running the agent

To launch the program, use the following command:

Unset

python main.py

Once the agent starts, you'll see the following prompt:

Unset

Press 'K' to start/stop recording, 'Q' to quit.

When you press 'K', the agent will begin recording. As you speak, your voice will be transcribed using OpenAI's Speech-to-Text model. When the lifecycle event detects that you've finished speaking, the agent will generate a response and play it back as synthesized speech.

Here’s a screenshot of an interaction between a user and the Voice Agent:

With that, we now have a fully functional Voice Agent!

Conclusion

Armed with the knowledge and starter code from this guide, you can take things even further by building more advanced Voice Agents tailored to specific use cases. For example, you could create:

Customer Service Agents – Handle inquiries, provide support, and resolve customer issues in real time.
Delivery Management Agents – Assist with order tracking, delivery updates, and logistics coordination.
Booking Assistants – Help users schedule appointments, book reservations, or manage travel plans.

The possibilities are endless. By customizing the workflow, integrating additional tools, and leveraging OpenAI’s powerful voice models, you can create highly interactive and intelligent Voice AI solutions for various applications.

Now it’s your turn to build something amazing! 🚀

The Neural Blueprint: Practical Content for AI Builders

Discussion about this post

Ready for more?

The Neural Blueprint: Practical Content for AI Builders

Building Voice Agents With OpenAI Agent SDK

TL;DR

A quick primer on Voice Agent

The OpenAI Agent SDK

Components of the OpenAI Agent SDK

The Voice AI Workflow

The Workflows

The Pipeline

Building the Voice Workflow

Installing the Voice Dependencies for Agent SDK

Creating the Agent's Tool

Defining the Agents

The Actual Workflow

Building a Real-Time Voice Agent

Defining the Attributes

Defining the Methods

Working with sounddevice

Implementing the Methods

The start_voice_pipeline Method

The send_mic_audio Method

The run Method

Testing the Voice Agent

Conclusion

Discussion about this post

Ready for more?

The `start_voice_pipeline` Method

The `send_mic_audio` Method