Python Apache-2.0 voicereal-timemultimodal

LiveKit Agents

Real-time voice and video AI agents built on WebRTC infrastructure

LiveKit Agents is an open-source Python and TypeScript framework for building real-time voice and video AI agents on top of WebRTC infrastructure. It handles the full STT/LLM/TTS pipeline, semantic turn detection, interruption management, and production deployment through LiveKit Cloud. Where most agent frameworks treat audio as an afterthought, LiveKit was built around it from the start.

Most agent frameworks were designed for text. The audio capability came later, bolted on through WebSocket streaming or third-party integrations that treat voice as one more input modality. LiveKit Agents took the opposite approach. The framework started with the hard problems of real-time audio (latency, packet loss, turn detection, interruption handling) and built the LLM orchestration on top of infrastructure that was already running production WebRTC traffic.

That lineage shows in what the framework handles well and where it draws its constraints. If you're building a voice-first product, LiveKit is the most complete starting point available in open source. If you're building a text-based automation pipeline that occasionally needs to speak an output, you're carrying WebRTC overhead for a problem you could solve more simply elsewhere.

Who builds with LiveKit Agents

The clearest use case is customer-facing voice agents that have real latency requirements. A conversational assistant that pauses for 2.5 seconds before responding feels broken to users. Getting reliable sub-second round-trips from speech input to speech output requires a transport layer designed for real-time media, not a general-purpose HTTP endpoint. LiveKit's WebRTC foundation is why teams building production voice products reach for it over alternatives assembled from separate services.

Teams handling telephone traffic are another strong fit. LiveKit ships a SIP stack alongside the agent framework, which means inbound call routing and outbound dialing go through the same infrastructure as browser-based WebRTC sessions. For a company building a voice agent that handles both web calls and PSTN calls, having that in one framework eliminates a vendor.

The multi-modal angle matters for applications where voice alone isn't enough. A video interview agent that can see the candidate's face, process what they say, and respond conversationally is a reasonable product. LiveKit handles the video track alongside the audio without requiring a separate video pipeline. The same session carries face, voice, and the LLM's response, which is a coordination problem that would require considerable custom work with separate services.

Core architecture

LiveKit Agents runs as a worker process that connects to a LiveKit server (either self-hosted or LiveKit Cloud) and waits for dispatch. When a user session arrives, the server routes it to an available worker according to the job scheduling rules you configure. The worker opens a room, joins it as an agent participant, and the pipeline starts.

The core pipeline is a chain from user audio input through STT, into the LLM, out through TTS, and back to the user as audio output. Each stage is a plugin, and the plugins are interchangeable. You can run Deepgram for STT, OpenAI's gpt-4o for the LLM, and Cartesia for TTS. Or you can replace the entire STT/LLM/TTS chain with a single OpenAI Realtime API call that handles all three stages in one round-trip, which reduces latency further.

The job dispatcher is more than a load balancer. It has an API for routing specific users to specific agent workers, for passing metadata into the session at dispatch time, and for triggering agent handoffs between different worker processes. Multi-agent architectures where one agent hands a call off to a specialist are a documented pattern, not an afterthought.

Real-time voice pipeline

The voice pipeline in LiveKit Agents is not a loop that polls for audio. It's a streaming chain that processes audio incrementally as it arrives. STT output starts flowing as the user speaks, the LLM begins reasoning before the sentence is complete in some configurations, and TTS starts generating audio before the full response is ready. Each stage is streamed to the next.

This streaming design is what allows the sub-second latency numbers that are advertised. A blocking pipeline that waits for complete speech, then calls the LLM with the full transcript, then waits for a complete LLM response, then sends to TTS would easily add 2-3 seconds to every turn. Streaming cuts that significantly. The exact number depends on your provider latency and network conditions, but the architecture eliminates unnecessary waiting at each stage boundary.

STT/LLM/TTS orchestration

Plugin selection determines the cost and quality profile of your pipeline. Deepgram gives you low-latency streaming transcription with good accuracy on phone audio. The OpenAI Realtime API lets you skip the STT/TTS steps entirely by having a single model handle audio in and audio out, which is the lowest-latency configuration but ties you to OpenAI's pricing and availability. Cartesia is a popular TTS choice for its low time-to-first-audio metric on streaming responses.

The framework doesn't force a choice. You configure the pipeline in Python with the providers you want, and swapping one out is a configuration change, not an architectural refactor. The plugin interface is documented enough that community contributions exist for providers beyond the official list.

A meaningful addition in recent releases is the OpenAI Realtime API integration. When latency is the primary constraint, running the whole STT/LLM/TTS chain through a single model call removes the network hops between stages. LiveKit Agents treats this as another pipeline configuration option rather than a special case, which keeps your application code consistent regardless of which approach you choose.

WebRTC infrastructure baked in

WebRTC handles more than audio transmission. It manages NAT traversal through STUN and TURN servers, adaptive bitrate for variable network conditions, packet loss concealment, and jitter buffering. For a live voice call, these aren't optional features. A user on a mobile network with variable signal needs the transport to adapt in real time, not drop packets silently.

Building this on raw WebSockets requires implementing each of those features yourself or accepting degraded quality. LiveKit's server handles all of it. From the agent worker's perspective, the audio arrives as a clean stream regardless of the network path it took to get there.

The server itself is open source and written in Go. You can self-host it on a single machine for development, or run it in Kubernetes for production with multiple regions. LiveKit publishes Helm charts and deployment guides. The operational surface area is real but well-documented.

A LiveKit room can carry multiple tracks: audio from the user's microphone, video from their camera, screen shares, and data channels for arbitrary messages. An agent participant in the room can subscribe to any of those tracks.

For voice agents, this means an agent can see the user's video feed and incorporate that into its reasoning. The vision capability works through the LLM's image understanding. Frames from the video track can be sampled and passed to a model like Gemini or GPT-4o as visual context. The blog from April 2026 documents data tracks as a mechanism for sending structured data alongside audio, which opens patterns like real-time structured data collection during a voice call.

Avatar integration extends this in the other direction. Instead of a voice agent with no visual representation, you can use Tavus or Bithuman to generate a video avatar that speaks the agent's responses. The agent pipeline runs identically; the avatar system renders the TTS audio as a photorealistic talking head. For use cases like video interviews or telehealth, this matters for user trust.

LiveKit Cloud for production deployment

Self-hosting LiveKit works and is well-documented, but running a production voice application means managing TURN servers, ensuring geographic distribution for latency, monitoring room health, and storing transcripts. LiveKit Cloud handles that infrastructure.

Beyond managed servers, LiveKit Cloud adds observability features specifically useful for voice agents: automatic transcript storage for every session, traces showing the latency breakdown across pipeline stages, and an Agent Console released in April 2026 for real-time debugging of agent behavior. These are the kinds of tools that matter when a customer reports that an agent talked over them during a specific call three days ago and you need to know why.

The pricing model is per-minute for media processing and agent compute, which means costs scale with usage rather than requiring upfront infrastructure commitment. For early-stage products, this is more economical than running dedicated servers. For high-volume deployments, the economics depend on your traffic pattern.

Where LiveKit Agents falls short

WebRTC complexity is the main cost of entry. Setting up a LiveKit server, understanding rooms and tracks, configuring STUN/TURN, and learning the dispatch API all take time before you get to agent logic. For a team that just wants a voice interface on a chatbot, that's meaningful overhead compared to a simpler WebSocket-based option.

The plugin ecosystem is narrower than general-purpose frameworks. LangGraph has a rich graph orchestration model and integrates with the broader LangChain tooling ecosystem. OpenAI Swarm is a simple multi-agent coordination pattern. LiveKit Agents covers the voice and video pipeline thoroughly but doesn't pretend to be a general orchestration framework. If your agent needs to pull from a vector database, run a document retrieval pipeline, or coordinate ten specialized subagents across a complex workflow, you'll integrate LiveKit as the voice transport layer while using another framework for the rest.

TypeScript support is present but Python is where most of the examples, plugins, and community knowledge live. The Python SDK has 98.8% of the repository's code by language. TypeScript teams are not blocked, but they'll find fewer ready-made examples than Python developers will.

Semantic turn detection, while significantly better than voice activity detection alone, is still not perfect. The framework's transformer-based model reduces false interruptions, but edge cases remain: overlapping speech in noisy environments, non-native speaker patterns, or very short pauses in a long sentence. Production deployments need testing against real audio samples from their user base, not just clean demo recordings.

LiveKit Agents vs the alternatives

Against Pipecat, the main alternative in the open-source voice agent space, LiveKit's advantage is the production infrastructure. Pipecat is a capable pipeline framework, but it doesn't ship WebRTC transport or a managed cloud tier out of the box. LiveKit has both, which matters when you're moving from prototype to production. The tradeoff is that LiveKit's abstraction layer is more opinionated.

Against building on raw provider APIs (assembling Deepgram, an LLM, and ElevenLabs yourself with WebSockets), LiveKit's value is everything above the transport layer: turn detection, interruption handling, job dispatch, multi-agent handoffs, and the observability layer. Those are features you'd otherwise build and maintain yourself.

Against LangGraph for multi-agent orchestration, the comparison is mostly orthogonal. LangGraph is a graph-based orchestration framework for complex stateful workflows. LiveKit Agents is a real-time voice and video transport and pipeline framework. They solve different problems, and using both together (LangGraph for the decision logic, LiveKit for the voice transport) is a reasonable architecture.

For teams evaluating Anthropic Computer Use or similar computer-control agents, the voice angle rarely applies directly. Those are desktop automation patterns rather than real-time conversation. The architectural overlap is minimal.

For AI coding assistants, LiveKit would appear if you were building a voice-first coding assistant, something closer to a conversational pair programmer than a text-autocomplete tool. The use case is niche but real.

Getting started

The quickstart gets you to a working voice assistant in about 10 minutes with Python. Install the SDK, configure your STT and TTS providers with API keys, point the agent at a LiveKit server, and run the worker. The starter templates cover the most common pipeline configurations so you're not assembling plugins from scratch on day one.

LiveKit also ships an Agent Builder for browser-based prototyping, which means non-engineers can experiment with voice agent behavior without writing code. This is useful for demoing configurations to stakeholders or testing TTS voice selection before committing to a provider.

The test framework is worth calling out separately. LiveKit Agents ships native testing support including a judge evaluation model that can assess agent responses against criteria you define. Testing real-time voice agents has historically been hard. You can't easily unit-test something that depends on streaming audio and LLM non-determinism. The judge model provides a structured way to evaluate whether the agent handled a scenario correctly, which moves voice agent testing closer to the standard agent evaluation patterns used in text-based frameworks.

Verdict

LiveKit Agents is the strongest open-source option for teams building real-time voice products. The WebRTC transport is production-grade, the STT/LLM/TTS pipeline handles the hard real-time coordination problems, and LiveKit Cloud removes most of the operational burden of running voice infrastructure. The 10,400 GitHub stars and active release cadence reflect genuine adoption, not just launch interest.

The honest qualification is that the framework's strengths are specific to voice and video. If your agent is mostly text-based with occasional audio output, the WebRTC complexity is overhead you don't need. But if real-time conversation is core to your product and the experience breaks when latency exceeds a second, LiveKit Agents is where you should start.

Key features

Streaming STT/LLM/TTS pipeline with sub-second end-to-end latency
Semantic turn detection using transformer models to cut false interruptions
Native WebRTC transport with no WebSocket workarounds needed
Multi-modal support: voice, video, vision, and text in a single session
Telephony integration via SIP for inbound and outbound call handling
MCP (Model Context Protocol) support for tool server integration
Built-in job scheduler and dispatch API for routing users to agent workers
LiveKit Cloud with observability, transcripts, and traces out of the box
Native test framework with judge evaluation for agent validation
Avatar integration with Tavus, Bithuman, and LemonSlice

Frequently Asked Questions

What is LiveKit Agents?

LiveKit Agents is an open-source Python and TypeScript framework for building real-time voice and video AI agents. It orchestrates the full pipeline from speech-to-text through an LLM to text-to-speech over a WebRTC transport layer, handling turn detection, interruptions, and multi-modal sessions. It can run self-hosted or on LiveKit Cloud.

Is LiveKit Agents free?

The core framework is Apache-2.0 licensed and free to use. You can self-host the LiveKit server and the agents worker on your own infrastructure at no cost. LiveKit Cloud, the managed deployment option, is a paid product with per-minute pricing for media processing and agent compute.

How does LiveKit compare to building voice agents with WebSockets?

WebRTC and WebSockets both carry audio, but WebRTC was designed for real-time media transport with adaptive bitrate, jitter compensation, and NAT traversal built in. For voice agents, this translates to more consistent latency under variable network conditions and proper handling of packet loss, things you have to implement manually on a raw WebSocket connection. LiveKit's own engineering team has published benchmarks showing WebRTC outperforming WebSockets for real-time voice specifically.

What STT, LLM, and TTS providers does LiveKit Agents support?

As of May 2026, the plugin library covers Deepgram and Silero for STT; OpenAI (including the Realtime API), Google Gemini Live, and others for LLM; and Cartesia and OpenAI for TTS. The list is growing with each release. The framework also supports the OpenAI Realtime API directly, which collapses the STT/LLM/TTS pipeline into a single model call.

Can LiveKit Agents handle phone calls?

Yes. LiveKit ships a SIP stack that connects to standard telephone networks for both inbound and outbound calls. An agent worker that runs over WebRTC can be exposed to phone callers through the same SIP bridge without additional vendors. Caller verification patterns are documented in the LiveKit blog.

Is LiveKit Agents production-ready in 2026?

Yes. The repository has 10,400 stars, 3,271 commits on main, and v1.5.8 shipped in May 2026 with 351 total releases. LiveKit's infrastructure is known to power high-traffic voice applications at scale. LiveKit Cloud adds managed observability, automatic transcript storage, and traces for teams that want a production-grade setup without managing their own server cluster.