Please Listen Carefully, As Our Menu Options Have Changed
You know the nightmare. You call your bank. A cheerful, soulless voice says, “Welcome to MegaCorp Bank! For English, press one. Para Español, oprima dos.”
You press one.
“Please listen carefully, as our menu options have recently changed to be even more confusing than before! For account balances, say ‘Balance.’ For transactions, say ‘Transactions.’ To speak to a human who may or may not be able to help you, say ‘Agent’ or sacrifice a goat under the full moon.”
You sigh and say “Agent.” The robot replies, “I’m sorry, I didn’t get that. Did you say… ‘egg plant’?”
This is the state of voice automation for the last 20 years. It’s slow, stupid, and makes customers want to throw their phone into a volcano. Today, we’re going to build the system that puts this entire industry out of its misery.
Why This Matters
In the last lesson, we built a lightning-fast AI brain with Groq. It could think in milliseconds. But a brain in a jar is a party trick. To be useful, it needs ears and a mouth. That’s what we’re building today.
A real-time voice agent isn’t just a gimmick. It’s a fundamental shift in business operations:
- 24/7 Tier-1 Support: Imagine a support agent that never sleeps, never gets tired, and can handle 80% of common customer questions instantly, at any time of day. This frees up your human team for the truly complex problems.
- Scalable Sales Development: What if you could have an AI make initial qualification calls to 1,000 new leads in an hour, asking 3-4 key questions, and then scheduling a demo with a human salesperson for only the hottest prospects?
- Perfect Customer Experience: No more pressing buttons. No more awkward pauses. Just a natural, fluid conversation that solves the customer’s problem and gets them off the phone happy.
We are replacing the clunky, hated IVR (Interactive Voice Response) system with a CVR (Conversational Voice Response) system. This isn’t an upgrade; it’s a revolution.
What This Tool / Workflow Actually Is
We’re building an AI voice agent pipeline. Think of it like a factory assembly line for conversation:
- Ears (Speech-to-Text): A library listens to the microphone, captures what you say, and turns your voice into plain text.
- Brain (LLM – Groq): The text is sent to our super-fast Groq API. The Llama 3 model thinks and generates a text response in a fraction of a second.
- Mouth (Text-to-Speech – ElevenLabs): Groq’s text response is sent to an API called ElevenLabs, which is famous for its realistic, low-latency AI voices. It turns the text into audio.
- Speaker: The audio from ElevenLabs is played back through your speakers.
This entire loop — from you finishing your sentence to the AI starting its reply — happens in under a second. That’s the threshold where conversation feels natural and not robotic.
What it is NOT: This is NOT a complete, enterprise-grade call center in a box. It doesn’t have memory (yet), and it can’t browse the web (yet). It’s the foundational engine of a conversational system, upon which we will build everything else.
Prerequisites
This is a step up from last time, but you can do this. I promise.
- Everything from the last lesson: A Groq account and API key, with Python installed.
- An ElevenLabs Account: Go to ElevenLabs and sign up for a free account. Their free tier is more than enough for this project.
- Your ElevenLabs API Key: Find it in your profile settings, copy it, and keep it safe next to your Groq key.
- A Microphone: Any basic microphone will do. The one built into your laptop is fine for testing.
That’s it. Let’s start building our robot’s face.
Step-by-Step Tutorial
We’re going to build this piece by piece, so you understand how the assembly line works.
Step 1: Install All the Necessary Libraries
We need a few more tools than last time. Open your terminal and run these commands one by one.
First, the AI services:
pip install groq eleven-python
Next, the tools for handling audio. This part can be tricky. The `SpeechRecognition` library needs another library called `PyAudio` to access the microphone.
pip install SpeechRecognition
pip install pyaudio
Troubleshooting Note: If pip install pyaudio fails (it sometimes does on Windows or Mac), don’t panic. The two most common fixes are installing it with Homebrew on Mac (brew install portaudio then pip install pyaudio) or using `pipwin` on Windows. A quick Google of “install pyaudio windows/mac” will solve it 99% of the time.
Step 2: Set Up Your API Keys
Just like last time, we’ll use environment variables. It’s the professional way to do it. In your terminal:
On Mac/Linux:
export GROQ_API_KEY='YOUR_GROQ_KEY_HERE'
export ELEVEN_API_KEY='YOUR_ELEVENLABS_KEY_HERE'
On Windows Command Prompt:
set GROQ_API_KEY='YOUR_GROQ_KEY_HERE'
set ELEVEN_API_KEY='YOUR_ELEVENLABS_KEY_HERE'
Now your script can access both keys securely.
Step 3: Write the Full Python Script
This is the whole machine. Create a file named voice_agent.py. Copy and paste the code below. I’ve added comments to explain every single line, so read through it carefully.
import os
import speech_recognition as sr
from groq import Groq
from elevenlabs.client import ElevenLabs
from elevenlabs import play
# Initialize the clients with API keys from environment variables
client_groq = Groq()
client_eleven = ElevenLabs()
# Initialize the recognizer
r = sr.Recognizer()
def listen_for_audio():
"""Captures audio from the microphone and transcribes it to text."""
with sr.Microphone() as source:
print("Listening... Speak now.")
r.pause_threshold = 1 # seconds of non-speaking audio before a phrase is considered complete
r.adjust_for_ambient_noise(source, duration=0.5)
try:
audio = r.listen(source, timeout=5) # listen for a maximum of 5 seconds
except sr.WaitTimeoutError:
print("Timeout reached. No speech detected.")
return ""
try:
print("Transcribing...")
text = r.recognize_google(audio)
print(f"You said: {text}")
return text
except sr.UnknownValueError:
print("Sorry, I could not understand the audio.")
except sr.RequestError as e:
print(f"Could not request results; {e}")
return ""
def get_ai_response(prompt):
"""Sends the user's prompt to Groq and gets a response."""
print("Getting AI response...")
chat_completion = client_groq.chat.completions.create(
messages=[
{
"role": "system",
"content": "You are a helpful assistant. Keep your answers concise and conversational."
},
{
"role": "user",
"content": prompt,
}
],
model="llama3-8b-8192",
)
return chat_completion.choices[0].message.content
def speak_text(text):
"""Converts text to speech using ElevenLabs and plays it."""
print("Speaking response...")
# Use a fast, conversational voice. You can find voice IDs on the ElevenLabs website.
audio = client_eleven.generate(
text=text,
voice="Rachel",
model="eleven_turbo_v2"
)
play(audio)
def main():
"""The main loop of the voice agent."""
while True:
user_input = listen_for_audio()
if user_input.lower() in ["quit", "exit", "stop"]:
print("Exiting program.")
speak_text("Goodbye!")
break
if user_input:
ai_response = get_ai_response(user_input)
print(f"AI says: {ai_response}")
speak_text(ai_response)
if __name__ == "__main__":
main()
Step 4: Run the Agent!
Go to your terminal, navigate to where you saved the file, and run:
python voice_agent.py
It will say “Listening… Speak now.” Ask it a question, like “What is the capital of France?” or “Explain quantum computing in one sentence.” Wait a moment, and you will hear a clear, fast voice answer you.
To stop the program, just say “quit” or “exit”. You’ve done it. You’ve built a real-time conversational AI.
Real Business Use Cases
This exact pipeline can be adapted for countless real-world scenarios:
- Restaurant: A voice agent that answers the phone, checks an API for table availability, and takes a reservation. “I see we have a table for two available at 7:30 PM. Should I book that for you?”
- Medical Clinic: An automated appointment reminder system that calls patients, confirms their appointment (“Please say ‘yes’ to confirm or ‘no’ to reschedule”), and updates a scheduling system.
- E-commerce Store: A customer calls to check their order status. The agent asks for the order number, looks it up in a database (we’ll learn this later), and provides a real-time update: “I see your order, number 12345. It shipped this morning and is scheduled for delivery on Friday.”
- Real Estate Agency: A 24/7 information line for property listings. A potential buyer calls a number on a “For Sale” sign. The agent answers, provides details about the property (bedrooms, price, etc.), and offers to connect them to a human agent.
- SaaS Onboarding Assistant: A voice-driven helper for new users. The user can ask, “How do I create a new project?” and the agent provides a quick, verbal walkthrough, freeing up the support team from repetitive setup questions.
Common Mistakes & Gotchas
- Bad Microphone Input: The whole system falls apart if the “ears” don’t work. A noisy room or a poor-quality microphone will lead to bad transcriptions, which leads to nonsensical AI responses. Test in a quiet space first.
- Ignoring Latency Stacking: Each step adds a tiny delay. Your internet speed matters. Transcription takes ~300ms, Groq takes ~200ms, ElevenLabs takes ~400ms. If any one piece is slow, the whole experience feels clunky. This is why using fast providers like Groq and ElevenLabs is critical.
- Not Having an Exit Word: Our script has a simple “quit” command. A real application needs a more robust way to handle the end of a conversation, or the loop will run forever.
- Choosing the Wrong Voice: ElevenLabs has many voices. Some are fast and conversational (“Rachel”), others are high-quality but slower. For real-time chat, always pick a voice optimized for speed.
- No Error Handling: What happens if the Groq API is down for a moment? Or ElevenLabs? Our simple script would crash. A production system needs `try…except` blocks around every API call to handle failures gracefully (“I’m sorry, I’m having a little trouble connecting right now. Please try again in a moment.”).
How This Fits Into a Bigger Automation System
Our voice agent is cool, but it’s still just a brain with a mouth and ears. It has no hands. It can’t *do* anything in the real world. The next step is to connect it to other systems:
- CRM Integration: When a known customer calls, the agent can first perform a lookup in your CRM (like Salesforce or HubSpot) using their phone number. This allows it to greet them by name (“Hi Sarah! How can I help you today?”) and access their order history.
- Tool Use: We can give the agent access to tools, like a calendar API. When a user asks to book a meeting, the agent can check the calendar for open slots, offer them to the user, and book the meeting directly.
- RAG Systems: For answering questions about specific documents, we can connect the agent to a RAG (Retrieval-Augmented Generation) system. This allows it to answer hyper-specific questions from your company’s knowledge base, instead of relying on its general knowledge.
This voice interface becomes the front door to a massive, interconnected automation backend.
What to Learn Next
You’ve built something that feels alive. You combined a fast brain with a fast voice and created a genuinely interactive AI. Take a moment to appreciate that. This was science fiction five years ago.
But our agent has a critical flaw: it has the memory of a goldfish. Each time you talk to it, it’s a brand new conversation. It doesn’t remember what you said 30 seconds ago.
In our next lesson, we’re going to fix that. We’re going to give our agent a memory. We’ll explore how to manage conversation history so it can have a coherent, multi-turn dialogue. We’ll upgrade it from a simple Q&A bot to a true conversational partner.
The foundation is built. Now, we start making it smart. See you in the next lesson.
“,
“seo_tags”: “ai voice agent, groq, elevenlabs, python tutorial, conversational ai, speech to text, text to speech, business automation, real-time ai”,
“suggested_category”: “AI Automation Courses

