Vapi
Developer-focused voice AI platform for building production-grade voice agents via API
Vapi is a developer-first API platform for building production-grade voice agents. You bring your own LLM, your own STT provider, and your own TTS provider, then Vapi handles the real-time orchestration layer that makes them talk to each other with low latency. Pay-As-You-Go at around $0.05 per minute. New accounts start with $200 in free credits. Used by YC startups and mid-market companies that want control over every layer of their voice stack.
Vapi launched in 2022 with a clear bet: developers building voice agents don't want an opinionated black box. They want infrastructure that handles the hard real-time coordination work while letting them choose every other layer themselves. Two years in, that bet looks correct. The platform sits at the center of a growing ecosystem of YC-backed companies and enterprise teams who treat voice as core product infrastructure rather than an afterthought.
The right way to think about Vapi is as a coordination layer, not a complete product. It connects a speech-to-text provider, a language model, and a text-to-speech provider into a pipeline that can hold a real-time phone conversation, handle interruptions, manage turn-taking, call external APIs mid-conversation, and stream audio back to the caller with sub-500ms latency on well-configured setups. What Vapi doesn't do is make decisions about which providers you should use. That modularity is the point.
Quick verdict
If you're a developer who wants control over the full voice stack and is willing to do the component selection work, Vapi is probably where you should start. The $200 in free credits gives you real runway. The TypeScript SDK is well-documented. The Pay-As-You-Go pricing at $0.05 per minute is the lowest entry point in the category. The trade-off is that "modularity" in practice means you're managing three provider relationships simultaneously and debugging issues that might originate in any one of them.
What Vapi actually does
The real-time voice pipeline problem is harder than it sounds. When a user speaks, you need to detect the end of their utterance, transcribe what they said, pass it to a language model, get a response, convert that response to audio, and start streaming the audio back before the user starts to wonder if the call dropped. That whole sequence needs to complete in under a second for the conversation to feel natural. Building that coordination layer from scratch, with proper streaming, turn-taking logic, and interruption handling, is weeks of engineering work even for an experienced team.
Vapi solves that specific problem. You configure which STT provider to use (Deepgram is popular for latency, AssemblyAI for accuracy on noisy audio), which LLM to call (OpenAI GPT-4o for quality, Groq's Llama inference for speed), and which TTS provider to use for voice output (ElevenLabs for quality, Play.ht for variety). Vapi handles the routing, the streaming, the phone infrastructure, and the webhook delivery.
The result is that a developer can go from zero to a working inbound phone agent in a few hours rather than a few weeks.
The provider modularity argument
Vapi's architecture means your total cost per conversation minute includes three separate bills: Vapi's own $0.05/min, plus your STT provider's per-minute rate, plus your LLM's token cost, plus your TTS provider's character cost. This sounds complicated, but it has a real advantage: you can optimize each layer independently.
If you're building an application where voice naturalness matters, you use ElevenLabs for TTS and accept the higher character cost. If you're building an internal tool where speed matters more than voice quality, you use a faster, cheaper TTS provider. If you have existing OpenAI credits, you point Vapi at GPT-4o. If you need faster inference for better conversation feel, you switch the LLM to Groq without changing anything else.
That granularity of control is rare in the voice agent space. Most competitors bake in their choices and charge a single per-minute rate that bundles everything. Vapi's approach serves developers who have already formed opinions about which AI providers they trust.
Developer experience
The TypeScript server SDK is the primary integration path. The documentation is reasonably thorough, with working examples for the common patterns: inbound call handling, outbound call triggering, function calling during a conversation, and webhook processing. The community Discord is active enough that most common questions have been asked and answered.
The web dashboard handles phone number provisioning, call logs, and basic monitoring. In early 2026 it's functional but not sophisticated. You can see what happened on a call, pull the transcript, and check the recording. You can't yet do complex analytics or A/B testing of prompts directly from the dashboard without external tooling. Teams building at scale tend to funnel call data into their own analytics infrastructure via webhooks rather than relying on what the dashboard surfaces.
Function calling during conversations is one of the more powerful features and worth understanding. You can define tools that Vapi will call when the language model decides it needs external data: check a user's account status, look up appointment availability, submit a form. The tool call happens mid-conversation, the result comes back to the LLM context, and the conversation continues. This is how voice agents go from "talking FAQ bot" to "agent that actually does things."
Pricing in practice
At $0.05 per minute from Vapi, plus typical component costs, a realistic all-in cost for a well-configured agent using ElevenLabs TTS and a capable LLM comes to $0.15 to $0.25 per minute. That's competitive with Retell AI's $0.07/min base rate once you account for the fact that Retell's rate includes their proprietary speech layers.
The $200 in free credits is meaningful. At $0.05/min Vapi charges plus reasonable component costs, you can run hundreds of test conversations before spending your own money. That's a real evaluation budget, not a token gesture.
For teams running significant call volume, the enterprise pricing conversation is worth having. Vapi has custom arrangements for teams with predictable volume, and the Pay-As-You-Go rate improves at scale in most cases.
Where Vapi fits and where it doesn't
Vapi is a strong fit for development teams building voice as a product feature. If you have engineers who can work with APIs, you want control over your speech and language model choices, and you're building something where the voice experience matters to your users, Vapi gets you there faster than building the orchestration layer yourself.
It's a weaker fit for non-technical teams or businesses that want to configure a voice agent through a UI rather than code. There's no meaningful no-code interface. You need to be comfortable with API configuration, webhooks, and provider management to use it effectively. For teams in that position, Synthflow and similar no-code platforms are more appropriate starting points.
It's also a weaker fit if you need enterprise-grade support SLAs from day one. The base tier's support response times are adequate for development but can be frustrating when debugging production issues under time pressure.
Vapi vs the field
Vapi vs Retell AI
Retell AI is the most direct comparable. Both are developer-focused, API-first platforms with similar use cases. Retell is more opinionated, bundles its own speech processing, and claims sub-800ms latency with emotion-adaptive dialogue that adjusts based on caller state. Vapi gives you more component flexibility. If you want a single vendor and a tuned out-of-the-box experience, Retell is worth evaluating alongside Vapi. If you have specific provider preferences or existing contracts, Vapi's modularity wins.
Vapi vs Bland AI
Bland AI focuses heavily on outbound calling infrastructure. It has phone number management and dialing automation baked in, which makes it a faster path to outbound call campaigns. Vapi can do outbound calling but it requires more configuration. If your primary use case is high-volume outbound dialing, Bland's specialized focus gives it an edge on that specific workflow.
Vapi vs ElevenLabs Conversational AI
ElevenLabs has its own Conversational AI platform that's an end-to-end solution with ElevenLabs' voice quality baked in. If voice naturalness is your primary concern and you're happy with a more integrated stack, ElevenLabs Conversational AI is worth comparing. Vapi lets you use ElevenLabs as the TTS layer while keeping flexibility elsewhere, so the two aren't mutually exclusive.
Getting started
The fastest path is: create an account at vapi.ai, claim the free credits, and work through the quickstart documentation for inbound call handling. That gets a functional agent on a real phone number in a few hours. The TypeScript SDK is the cleanest integration path for teams working in Node.js. The REST API works for any other language.
Before you pick your component providers, spend an hour with Deepgram's free tier to understand STT options, and run your intended script through ElevenLabs or Play.ht to hear the TTS output before you commit. Voice quality has a bigger impact on how users perceive your agent than almost any other variable, and making that choice early saves painful migration later.
The function calling system is where the real power is. Getting a basic agent running is a few hours of work. Getting an agent that calls your CRM, checks inventory, and routes edge cases to human agents is the interesting engineering problem, and Vapi's webhook architecture is designed for exactly that kind of integration.
For teams comparing the full voice agent landscape, the profiles on Retell AI, Bland AI, and Deepgram cover the closest adjacent tools in detail.
Key features
- Real-time streaming voice with sub-500ms response latency on most configurations
- Bring your own LLM: works with OpenAI, Anthropic, Groq, Together, and local models
- Bring your own STT and TTS providers including Deepgram, ElevenLabs, and Play.ht
- Phone number provisioning and outbound/inbound call management via API
- Function calling and tool use for external integrations mid-conversation
- Server-side webhooks for call events, transcripts, and custom business logic
- Multi-language support via swappable STT and TTS providers
- Call recording and real-time transcription with speaker diarization
Pros and cons
Pros
- + Full modularity: swap any LLM, STT, or TTS provider independently without changing the rest of your stack
- + Pay-As-You-Go pricing with no monthly minimums makes it accessible for early-stage projects
- + $200 in free credits gives real runway to build and test before spending money
- + Server-side webhooks give you clean hooks into business logic without awkward polling
- + Active developer community and TypeScript SDK make integration faster than building from scratch
- + Handles both inbound and outbound phone calls through the same API
Cons
- − The modularity is also the complexity: you need to understand each provider's pricing separately
- − No built-in voice quality guarantee since voice quality depends on which TTS provider you choose
- − Dashboard and monitoring tooling is less mature than some enterprise-focused competitors
- − Support response times on the base tier can be slow for debugging production issues
Who is Vapi for?
- Customer support bots handling inbound calls with full CRM integration via function calling
- Outbound appointment reminder and confirmation calls with natural conversation flow
- Lead qualification calls that hand off to human agents when intent score crosses a threshold
- Voice-driven internal tools where employees interact with business systems by phone
Alternatives to Vapi
If Vapi isn't quite the right fit, the closest alternatives are retell-ai , bland-ai , elevenlabs , and deepgram . See our full Vapi alternatives page for side-by-side comparisons.
Frequently Asked Questions
What is Vapi AI?
How much does Vapi cost?
How does Vapi compare to Retell AI?
Does Vapi support outbound calls?
Which LLMs work with Vapi?
Related agents
Air AI
AI sales agent for extended outbound phone conversations up to 40 minutes focused on appointment setting
Anthropic Computer Use
Claude's computer-use capability that powers desktop and browser agents
AssemblyAI
Speech-to-text API and audio intelligence platform with LLM-powered analysis via LeMUR