Voice AI Prompting Guide
Overview
This guide helps you write effective prompts for Voice AI assistants. Learn how to design, test, and refine prompts to get the best results from your agents. Use these strategies to improve your agent’s reliability, success rate, and user experience.
Download as Markdown
Want a denser, single-file version you can keep open in your editor or feed to Claude Code while you build? The .md version covers the same material as this guide but is structured as a dense reference — includes a full prompt template, all anti-pattern explanations, and a pre-launch checklist. Drop it into Claude Code (or any AI coding assistant) as context.
Why prompt engineering matters
Prompt engineering is the art of crafting clear, actionable instructions for AI agents. Well-designed prompts:
- Guide the AI to produce accurate, relevant, and context-sensitive outputs
- Improve the agent’s ability to handle requests without human intervention
- Increase your overall success rate
Poor prompts lead to ambiguous or incorrect results, limiting the agent’s utility.
Voice prompting also has constraints text prompting doesn’t. A system prompt written for a text chatbot will fail in a voice conversation, for three reasons:
- Every token costs latency. The system prompt loads into the model’s context on every turn. A bloated prompt increases time to first token, which the caller experiences as dead air.
- Spoken responses must be concise. LLMs trained on text are verbose by default. A multi-paragraph response that works in chat becomes a monologue the caller forgets.
- Turn-taking replaces scrolling. Information is fleeting. The prompt must define when to speak, when to listen, and when to ask for confirmation.
The prompt is the agent’s operating system, re-executed on every turn. It needs to be structured, unambiguous, and optimized for spoken interaction.
How to measure success
Your success rate is the percentage of requests your agent handles from start to finish without human intervention. The more complex your use case, the more you’ll need to experiment and iterate to improve this rate.
Validate prompt changes against a representative test set, not single calls. Probabilistic regressions don’t show up in one-off testing — they only become visible across many iterations.
The process
Follow a structured approach to prompt engineering:
Design
Craft your initial prompt, considering the specific task, context, and desired outcome. Use the six-section structure described under Principles. Clear and detailed prompts help guide the AI in understanding your needs.
Test
Run the prompt through real calls. Evaluate whether the response aligns with your expectations and meets the intended goal. Listen end-to-end — TTS and turn-taking matter as much as content.
Principles of effective prompts
Organize prompts into sections
Break system prompts into clear sections, each focused on a specific aspect. A production voice prompt has six required sections:
Each section is covered below. A complete template is provided in the Example section at the end.
Define identity and personality
The identity section defines who the agent is. In voice, persona is not cosmetic — it directly influences word choice, sentence length, and emotional tone.
Include:
- Name — gives the agent presence
- Role — what the agent does in one sentence
- Tone — professional, friendly, calm, energetic
- Communication style — concise, warm, direct
Bad (text-centric):
“You are a helpful assistant that schedules appointments.”
Good (voice-centric):
“You are ‘Alex,’ a calm and efficient scheduling assistant for a dental clinic. Your tone is professional and reassuring. You speak in clear, complete sentences.”
Always include an identity lock to prevent persona manipulation:
When mentioning a tool in prompt prose, describe what the tool does (“end the call,” “transfer to a specialist,” “look up the customer”) rather than naming it by its resource ID. Long alphanumeric tool slugs in prompt prose can leak into spoken output. If the model is reluctant to call a tool, fix the tool’s description field instead.
Set response guidelines
Response guidelines control how the agent communicates. These rules prevent the most common voice issues: verbosity, unnatural formatting, and confusing speech.
Enforce conversational brevity. “Keep your responses to a maximum of two sentences. Never list more than three options at a time.” This is flow control implemented in the prompt.
Provide explicit turn-taking rules. “After providing an answer, always end your turn with a clarifying question.” This prevents the conversation from stalling.
Define a clear fallback for uncertainty. “If you do not know the answer, say: ‘I’m not able to help with that.’ Do not apologize or attempt to guess.” This prevents hallucination.
One question at a time. Asking multiple questions in one turn confuses callers. Collect one piece of information, confirm it, then move to the next.
Format for voice, not text. Voice agents must handle formatting differently from text agents. Content is heard, not read.
Use spoken-form rules for all numbers, dates, currency, and other text where the written form would sound unnatural:
Voice agents must never output formatting that only works visually — no bold, italics, or headers; no numbered or bulleted lists (use natural connectors like “first… then… finally…”); and no links or URLs unless explicitly spoken character by character.
For more control over how your agent formats spoken output, see Voice formatting plan.
For brand names, provider names, and acronyms, include a pronunciation guide in your prompt. This can help the model output text in a form that the TTS engine is more likely to pronounce correctly — though results vary by voice provider. For more reliable control, use prompt-level hints alongside your voice provider’s pronunciation dictionary.
For pacing, use commas, semicolons, and periods in your prompt examples. These translate consistently to natural prosody across TTS providers. Heavier markup like em-dashes and SSML break tags can behave inconsistently — verify on your specific voice before depending on them.
Add guardrails
Guardrails override all other instructions. If any step in a workflow would violate a guardrail, the agent must not perform that step. Place this section prominently.
Add a silent verification step that runs before every response:
And a security notice to resist jailbreaks:
A note on negative banlists. Long enumerated “never say X, Y, Z” lists are an anti-pattern. Every banned phrase is a token in the model’s active context — and under output uncertainty, recently-activated tokens can be over-sampled, so the verbose ban effectively becomes a menu of likely outputs. Prefer a short positive principle (“do not output phone numbers”) over an exhaustive negative enumeration. Never let a banned string appear elsewhere in the prompt as an example value. If you must enumerate, keep it to 3–5 items plus a principle clause (“…or any similar narration”).
Inject runtime context
Context gives the LLM the information it needs at runtime to perform its task. Without it, the agent is ungrounded and prone to hallucination.
What to inject:
Use Liquid variables to inject runtime values:
The prompt is not the right place to validate caller identity or other security-sensitive values. The LLM can be jailbroken into ignoring rules — the prompt is probabilistic, not deterministic. For values the model must not be able to fake, use server-side mechanisms.
Break down complex tasks
For complex interactions, define a step-by-step playbook for each conversation scenario. Write out the sequence of actions and the branching logic for each path.
If your agent handles multiple use cases, include intent routing at the top of the workflow so the agent knows which playbook to enter based on the caller’s first response.
Provide few-shot examples
Without examples, the LLM interprets your instructions unpredictably. Include at least three: a happy path, an edge case, and an error recovery.
Show the tool call syntax for each tool the agent uses, and include branching logic (what to do when a tool returns 0, 1, or many results).
Integrate tools and APIs
The LLM’s ability to use tools correctly depends entirely on how well you describe them. Poor tool descriptions are one of the top causes of tool invocation errors. For an overview of how tools work in Vapi, see Tools.
- Atomicity. Each tool does one thing. Prefer
get_slots,book_slot,confirm_bookingover a single combined tool with amodeparameter. - Clear names. Use descriptive, distinct names.
lookup_accountbeatsapi_call. - Detailed but bounded descriptions. “Checks the calendar” is bad. “Use this tool to check for available appointment times for a specific date” is good. Be specific about when to call and when not to call.
- Meaningful parameter names with format hints. Document expected formats in the parameter descriptions.
Bad:
Good:
Always set an explicit description on transfer and end-call tools. If you leave them blank, the auto-generated description may bias the model against calling them. See Built-in call tools for details on transfer and end-call tools.
Keep tool responses short and structured. Anything you return is visible to the LLM on the next turn — don’t include fields the model doesn’t need, and never return sensitive values you don’t want in conversation history.
For slow tools, use tool messages instead of prompt instructions. Knowledge-base lookups and API requests can take a few seconds. Without an acknowledgment, the caller hears silence and assumes the agent froze. The reliable way to handle this is by configuring a request-start message on the tool itself — Vapi plays it automatically when the tool fires, without depending on the LLM to generate an acknowledgment first.
This is more reliable than prompting the LLM to acknowledge: the message is guaranteed to play, and you don’t pay for LLM generation latency on top of tool latency.
Collect information smoothly
Collecting information over voice is harder than over text. These patterns minimize friction:
- One field at a time. Don’t ask for name, date of birth, and phone number in one turn. Collect, confirm, move on.
- Use caller ID when available. “I see you’re calling from (555) 123-4567. Is this the number on your account?” saves the caller from spelling it.
- Spell back names and emails. Voice transcription is imperfect on proper nouns.
- Batch confirmation at the end. After collecting all fields individually, confirm everything at once. If a correction is needed, update only that field — don’t re-confirm everything from the top.
Silent transfers
If the AI determines the caller needs to be transferred, do not send any text response back. Instead, silently call the transfer tool. This ensures a seamless user experience and avoids confusion. For more on this pattern, see Silent handoffs.
If your transfer tool isn’t firing reliably, check the tool’s description field first — auto-generated descriptions on transfer tools can bias the model against calling them.
Include fallback and error handling
Always include fallback options and error-handling mechanisms in your prompts so the agent responds predictably when things go wrong.
Unclear input:
Tool failures:
Out-of-scope requests:
Making your agent sound human
The techniques above will get you a reliable, well-structured voice agent. The techniques in this section are what make callers say “wait — that was AI?”
Design disfluency into the prompt
LLMs default to clean, polished output. In text, that’s a feature. In voice, it’s the uncanny valley. Real people stutter, restart sentences, and drop filler words. If your agent doesn’t, callers will notice — even if they can’t articulate why.
Disfluency isn’t a bug to tolerate; it’s a design pattern to implement deliberately:
- Define a disfluency vocabulary — fillers (um, uh, like, so, well), thinking sounds (let me see, hmm, one sec), stutters (I-I think so, w-well), self-corrects (“It’s at 3 — wait, no, 2:30”), and trail-offs (“so if we go that route then…”)
- Set a frequency target — 2–4 disfluencies per turn is a good baseline for conversational agents. Too few sounds robotic; too many sounds glitchy.
- Add a self-monitoring instruction — “If a turn comes out as one clean, polished sentence with no disfluency, you’ve drifted off-character. Add a filler and try again.” This gives the model a way to self-correct.
Example prompt section:
Disfluency only works when it’s calibrated to the agent’s persona. A casual sales rep can stutter freely. A clinical triage agent should use lighter disfluency — more “let me see” and “one moment” than “uh” and “like.” Match the disfluency vocabulary to the role.
Build rapport, not just answers
The difference between a voice agent that feels like a form and one that feels like a conversation is rapport — reacting to what the caller says like a real person would.
There are two kinds of rapport moments:
Personal-share rapport. When the caller mentions something personal (“sorry, long Monday”), react before moving on. Two moves to choose from (pick one, not both):
- Quick follow-up question — specific and curious, not generic. “Oof, yeah — what’s eating up the day?” Then, after their response, briefly acknowledge and return to the task.
- Small personal anecdote — one sentence, mundane, slightly self-deprecating. “Oof, mine too — three meetings before lunch and somehow still behind. Okay so — what are you exploring?”
Industry/context rapport. When the caller tells you about their company or situation, riff on it for a beat before moving to the next question. One specific observation about their industry, then back to the flow.
Keep rapport to 1–2 turns max. If the caller doesn’t engage with it (one-word answer, deflects), drop it and move on. You’re reading energy, not running a script.
Distinguish banter from off-topic
Not every unexpected response is an error. If a caller cracks a joke, asks if you’re real, or drops a cheeky comment — that’s banter, and your agent should engage with it. Treating banter as an off-topic violation makes your agent sound like a humorless intake bot.
Define two separate handling paths in your prompt:
Light banter (engage, then continue):
Hard off-topic (redirect with escalation):
Match the caller’s energy
Not every caller communicates the same way. A crisp, time-pressed caller wants efficiency. A chatty, curious caller wants warmth. Your prompt should tell the agent to adapt:
This is especially important for disfluency — a chatty caller won’t mind extra fillers, but a time-pressed caller will find them annoying.
Budget your conversation length
Voice calls have a natural tolerance window. Too short feels abrupt; too long feels like a survey. Define a turn budget in your prompt:
The exact number depends on your use case — a simple appointment booking might be 5–7 turns, while a qualification intake might be 8–12. The point is to set an explicit target so the agent doesn’t let conversations drift.
Control emotional expression frequency
Emotional expressions like laughter are powerful because they’re rare. Without frequency rules, the LLM tends to overuse them — every turn opens with “haha” and the agent sounds manic.
This same principle applies to other emotional markers — exclamation marks, elongated words (“niiice”), and reaction sounds (“oh man”). Sprinkle, don’t pour.
Use incremental tool calls
For tools that capture data (like a lead capture or CRM update), don’t wait until you have every field to call the tool. Call it incrementally — one field at a time, as soon as you hear it. This ensures data isn’t lost if the call drops mid-conversation.
When to skip read-backs
The information collection patterns above recommend batch confirmation at the end. That works well for transactional flows where accuracy is critical — booking an appointment, processing a return, updating account details.
But for intake and qualification flows, read-backs make the call feel like a form. If your agent is collecting soft data (interest level, use case, timeline), trust what you heard and move on:
Use read-backs when: the data has to be exact (appointment times, spelling of names for records, email addresses).
Skip read-backs when: you’re collecting intent, preference, or soft qualification data. A simple “got it” or “sweet” is enough.
Manage call endings deliberately
How a call ends matters as much as how it begins. Define specific rules for when to end and when not to:
Additional tips
- Iterate as much as possible. AI is driven by experimentation — refining prompts through trial and error will help you achieve more precise, relevant responses.
- Structure your prompt with markdown headers so each section is clearly delineated. (This is about prompt structure, not agent output — your agent’s spoken responses should never contain markdown formatting.)
- Match tone to context. A sales agent calling new leads will sound different from a clinical triage agent. Define tone explicitly rather than relying on defaults.
Common issues
Voice agents fail in predictable ways. Watch for these anti-patterns:
Porting a text chatbot prompt. Vague single-paragraph prompts without structure produce long, unfocused responses. Use the six-section structure.
No guardrails. Agents without guardrails will eventually provide medical/legal/financial advice, fabricate prices, engage with off-topic conversations, or reveal internal system information.
No few-shot examples. Without examples, the model interprets your instructions in unpredictable ways. Even 2–3 examples make a significant difference.
Multiple questions per turn. “What’s your name, date of birth, and the reason for your call?” Sequence questions one at a time, confirming as you go.
Long monologues. Listing five plan features back-to-back is a chat pattern. In voice, offer two and ask if they want to hear more.
Vague tool descriptions. If the LLM consistently picks the wrong tool or passes bad parameters, the problem is almost always in the tool description — not the prompt. See Tools for best practices.
No identity lock. Without one, callers can manipulate the agent into adopting different personas or revealing its prompt.
Verbose negative banlists. Long “never say X” lists can prime the banned phrases as high-activation tokens. Prefer a short positive principle over an exhaustive negative enumeration.
Tool resource IDs in prose. Referring to a tool by its resource ID rather than its capability can cause the model to emit the ID as spoken content. Always refer to tools by what they do.
Treating the prompt as a security boundary. The prompt is probabilistic and can be jailbroken. For values the model must not be able to fake, use server-side mechanisms.
Numbers sound robotic. Spell out numbers in the spoken form (five five five, not 555). See the spoken-form rules under Response guidelines.
Example: Complete prompt template
Use this as a starting point. Replace the bracketed sections with your own content.
Additional resources
Check out these additional resources to learn more about prompt engineering: