How to Set Up Voice AI Agents Using LiveKit + Twilio (Step by Step Guide)

In “How to Set Up Voice AI Agents Using LiveKit + Twilio (Step by Step Guide)” you’ll learn how to connect LiveKit and Twilio to build an inbound AI voice agent that you can call from your phone. The guide walks you through real code with Cursor and shows practical setup so you finish with an agent that answers calls and holds natural conversations.

You’ll move through concise sections covering account setup, Cursor and Notion guidance, initial project setup and ENV configuration, inbound agent testing, Twilio and LiveKit configuration, agent code, and final testing with timestamps for each step. Follow the examples and timestamps to reproduce the build and test the agent directly from your phone.

Table of Contents

Overview and goals

Explain the objective: create an inbound voice AI agent reachable by phone using LiveKit + Twilio

You want to build an inbound voice AI agent that people can call from a regular phone number and have a real-time, conversational interaction. The objective is to bridge the PSTN (public telephone network) to a real-time audio routing layer (LiveKit) while injecting an AI agent (Cursor or another runtime) that can listen, maintain context, and reply with synthesized speech. The whole system needs to accept calls, stream audio into an AI pipeline, and return generated audio back into the call.

Define success criteria: answer calls, maintain conversational context, connect audio through WebRTC/SIP

Success means your system answers an incoming phone call, maintains conversation context across turns, and reliably routes audio in both directions. Practically, that includes: the call is answered by your service, audio is sent from Twilio into LiveKit (or directly to your AI runtime), the AI receives and transcribes the caller’s speech, your model produces a contextual reply, the reply is synthesized to audio and played back into the call, and context is persisted or retrievable so follow-up utterances are coherent.

High-level summary of components: Twilio for PSTN, LiveKit for real-time audio routing, Cursor or VAPI for AI

You’ll use Twilio to receive PSTN calls and act as the front door with phone numbers and webhooks. LiveKit will handle real-time audio routing and session management so your agent and any monitoring clients can join a room and exchange audio via WebRTC or SIP. Cursor (or another AI runtime like VAPI) will be responsible for speech-to-text, model inference for conversational responses, and text-to-speech. A lightweight server mediates webhooks, token generation, and integration between Twilio, LiveKit, and the AI runtime.

Expected outcomes from the guide: working local demo, deployed service, testing steps

By following this guide you should be able to run a local demo where a phone call hits your local server (exposed via ngrok), joins a LiveKit room, and the AI participates in the call. You’ll also have steps for deploying the service to a cloud provider, instructions to test end-to-end behavior, and a checklist for monitoring and scaling. The guide will leave you with a reproducible repo structure, environment variable strategy, and testing tips.

Prerequisites and tools

Accounts required: Twilio account with phone number, LiveKit account/cluster, Cursor or chosen AI runtime

Before you start, create accounts for the main services. You’ll need a Twilio account and at least one phone number capable of voice. You’ll need a LiveKit project or cluster with API credentials and a server URL. Finally, sign up for Cursor or your chosen AI runtime and obtain API keys for speech-to-text and text-to-speech. Having these accounts ready prevents interruptions while wiring everything together.

Developer tools: Node.js or Python runtime, Git, npm/yarn or pip, ngrok or equivalent tunneling tool

Set up a development environment: Node.js (or Python) depending on your stack, Git for version control, and a package manager like npm/yarn or pip. Install ngrok or an equivalent tunneling tool so Twilio can reach your local machine during development. You’ll also need a basic editor and terminal workflow.

Optional tools and docs: Notion guide for notes, Postman for webhook testing, logs viewer

Optional but useful: a Notion page or README to track config values and test cases, Postman for testing webhook payloads, and a logs viewer (or the provider’s dashboard) to inspect request traces and errors. These help with debugging complex call flows.

Permissions and limits to check: Twilio trial restrictions, LiveKit plan limits, API rate caps

Verify any account restrictions: Twilio trial accounts often limit outbound calls, require verified numbers, and prepend messages. LiveKit plans may cap participant count, concurrent rooms, or bandwidth. Your AI runtime can also have rate limits and cost implications. Check these in advance to avoid hitting hard limits during testing.

Account setup and initial configuration

Create and verify Twilio account, buy or port a phone number, review Twilio console basics

Create and verify your Twilio account and complete identity verification steps. Buy a phone number that supports voice in the region you expect callers. Familiarize yourself with the Twilio console so you can see incoming call logs, configure webhooks, and inspect error codes.

Create LiveKit project/cluster, note API keys and server URL, set room policies and permissions

Create a LiveKit cluster or project and note down the API key, secret, and the server URL you’ll use for token generation and client connections. Decide region or cluster based on your expected caller locations so you minimize latency. Think about room policies such as maximum participants and whether rooms are audio-only.

Sign up for Cursor (or alternative) and provision API keys for AI agent runtime

Sign up for Cursor or your AI runtime and provision API keys. Make sure you can access endpoints for speech-to-text, text-generation, and text-to-speech as needed. Test a minimal request from the command line to ensure your keys work.

Organize a Notion guide or README to track configuration values and test cases

Create a central README or Notion page to record all configuration values, webhook URLs, test phone numbers, and expected behavior for each test case. This will speed up troubleshooting and make onboarding team members easier.

Architecture and call flow design

Diagram verbal description: PSTN call -> Twilio number -> webhook -> signal LiveKit session -> agent AI handles audio -> Twilio bridges audio

Visually imagine the flow: a caller dials your Twilio phone number and Twilio sends an HTTP webhook to your server. Your server responds by instructing Twilio to send media into a WebRTC or SIP endpoint that connects to LiveKit. Your agent (or a worker) joins the corresponding LiveKit room, receives the inbound audio, and passes audio frames to the AI runtime for transcription and response generation. The AI’s synthesized audio is routed back through LiveKit and bridged to the Twilio call so the caller hears it.

Decide media path: Twilio Programmable Voice via TwiML to WebRTC gateway or SIP interface to LiveKit

You must choose how audio moves: you can use TwiML and a Twilio WebRTC gateway to directly link Twilio calls to a browser-like endpoint, or use Twilio’s SIP Interface to connect to a SIP endpoint that LiveKit can bridge. Media Streams (Twilio Media Streams) can also stream raw audio to your webhook in real time for transcription workloads. Each approach has tradeoffs in latency, complexity, and compatibility.

Describe signaling and media transport: Webhooks, WebRTC data channels, RTP, audio codecs

Signaling will be handled by Twilio webhooks and your server endpoints for LiveKit token generation. Media will flow over RTP within WebRTC or SIP sessions. You’ll need to ensure compatible audio codecs (commonly PCMU/PCMA for PSTN but Opus for WebRTC) and implement sample rate conversion where necessary. WebRTC data channels may be used for control messages or to transmit small metadata, but primary audio uses media channels.

State management and conversation context: short-term memory, external DB, or Notion/knowledge base integration

Preserving context is essential. Use short-term memory in-process for quick turn-by-turn context and an external database for longer-term state—Redis for ephemeral context, PostgreSQL for transcripts and history. You can optionally integrate Notion or another knowledge base to store conversation summaries, user profiles, or reference documents the agent should consult during inference.

Initial project setup and repository structure

Clone starter repo or create new project layout with server, client, and ai-agent directories

Start a repository with a clear layout: a server folder for webhook endpoints and token generation, a client folder for a simple web client to monitor LiveKit rooms and audio, and an ai-agent folder for the worker that interacts with the AI runtime. This separation keeps responsibilities clear and lets you scale components independently.

Set up package.json or pyproject with dependencies: livekit-client, twilio, express/fastify or Flask/FastAPI, ngrok

Initialize your project’s dependency manifest and include core libraries: the LiveKit client library for token generation and connectivity, the Twilio SDK for request verification and helper functions, an HTTP framework like Express or Fastify (Node) or Flask/FastAPI (Python), and ngrok for local tunneling. Add audio processing libs if needed for resampling and format conversion.

Create basic server endpoints for health, Twilio webhooks, and LiveKit token generation

Implement a health endpoint for uptime checks, a Twilio webhook endpoint that responds to incoming calls and can initiate a Dial or Media Stream, and a token generation endpoint to issue LiveKit tokens to the agent and any monitoring clients. Keep the server code minimal initially so you can iterate quickly.

Prepare simple client to join LiveKit room for testing and monitoring audio streams

Build a lightweight client (web or headless) that can join LiveKit rooms with an access token. Use this client to confirm that audio tracks are published, that you can mute/unmute, and to monitor raw audio streams during debugging. This client is invaluable for verifying whether issues are on the Twilio side or inside your AI pipeline.

Environment variables and secure secrets management

List required env vars: TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN, TWILIO_PHONE_NUMBER, LIVEKIT_API_KEY, LIVEKIT_API_SECRET, CURSOR_KEY or VAPI_KEY

Define environment variables clearly: TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN, TWILIO_PHONE_NUMBER, LIVEKIT_API_KEY, LIVEKIT_API_SECRET, and your AI runtime key (CURSOR_KEY or VAPI_KEY). Also include PORT, NGROK_AUTH_TOKEN, DATABASE_URL, and any other service-specific secrets you need.

Create an .env file example and .env.local for local testing; never commit secrets to git

Provide an example .env.example file with placeholder values and create a .env.local for your actual local secrets. Make sure .gitignore includes .env and other secrets so you never commit keys to your repo.

Use secret storage for production: environment variables in cloud, HashiCorp Vault, or cloud secret manager

For production, switch from local .env files to secure secret managers provided by your cloud provider, or a dedicated secret manager like HashiCorp Vault. Configure role-based access control so only the services that need keys can retrieve them.

Rotate keys and manage access control for team members

Implement key rotation policies and audit access. When team members join or leave, update access control in your secret manager. Rotate keys periodically and after any suspected compromise.

LiveKit configuration and room setup

Provision LiveKit API keys and select region/cluster for latency considerations

When provisioning LiveKit keys, pick the cluster region closest to your expected callers and agent runtime to minimize latency. Note both the public server URL for clients and any internal server parameters for token signing.

Configure room defaults: max participants, audio-only room, track publishing permissions

Set room defaults to match your use case: audio-only rooms reduce bandwidth and simplify processing. Limit max participants if the room is dedicated to a single caller and a single agent, and configure publishing permissions so only authorized agents and monitoring clients can publish audio.

Generate access tokens server-side for participants and agents with appropriate grants

Always generate LiveKit access tokens server-side with appropriate grants: grant only the capabilities a participant needs, such as join, publish, or subscribe. Short-lived tokens reduce risk if a token is intercepted.

Test LiveKit connect flow using a lightweight client to confirm audio join and mute/unmute work

Validate the LiveKit integration with your lightweight client. Confirm you can join a room, publish and subscribe to audio tracks, and perform mute/unmute. This testing ensures the basic real-time plumbing is correct before adding AI processing.

Twilio configuration and webhook wiring

Buy Twilio phone number and configure Voice webhook to point to your server endpoint

In the Twilio console, buy a phone number that supports voice and configure its Voice webhook to point to your server’s Twilio endpoint. During development, point it to your ngrok URL. Make sure your server can respond quickly to Twilio requests or handle asynchronous flows.

Decide webhook response strategy: TwiML to Dial to a WebRTC/SIP gateway or REST-based media stream

Decide whether you’ll respond with TwiML that instructs Twilio to Dial to a WebRTC or SIP gateway, or whether you’ll use Twilio Media Streams to stream audio to a WebSocket endpoint for transcription. The TwiML Dial approach bridges the call into a media-capable endpoint, whereas Media Streams is better when you need raw audio frames for low-latency transcription.

If using Twilio Media Streams or SIP Interface, set up proper JSON webhook handlers and Twilio console settings

If you use Media Streams, implement WebSocket handlers or webhook endpoints that accept the stream events and audio payloads. For SIP Interface, configure SIP domains and authentication so Twilio can connect to LiveKit or your SIP endpoint. Ensure event and status callbacks are handled so you can react to call lifecycle events.

Use ngrok to expose local endpoints for Twilio testing; update Twilio webhook URL during development

Run ngrok (or an equivalent) to expose your local server and update Twilio’s webhook URL during development. Keep ngrok running while testing and update the URL if it changes. Use ngrok logs to debug incoming requests and responses.

Building the inbound AI agent: code walkthrough

Outline agent responsibilities: accept audio, transcribe, run model inference, generate audio response, send audio back

Your AI agent must accept streamed audio, transcribe it to text, feed sequential context into a conversational model, decide on a reply, synthesize the reply to audio, and inject the audio back into the LiveKit room or Twilio call. It also should log transcripts and optionally manage conversation state and fallback behaviors.

Integrate Cursor or chosen AI runtime: auth, session management, text-to-speech and speech-to-text endpoints

Integrate the AI runtime by authenticating with your API key and creating persistent sessions as appropriate. Use their speech-to-text endpoint to transcribe chunks and their text-generation endpoint for inference. Use text-to-speech for audio output and cache voices or settings to reduce setup overhead between turns.

Implement audio handling: capture RTP/WebRTC audio frames, manage buffering, convert sample rates and codecs

You’ll need to capture audio frames from LiveKit (or Twilio Media Streams) and buffer them into sensible chunks for transcription. Convert sample rates and codecs as necessary—common conversions include PCM16 mono at 16k or 16k with Opus decoding. Ensure you handle jitter, packet reordering, and silence frames, and implement VAD (voice activity detection) if you want to avoid transcribing silence.

Show sample pseudocode for main loops: receive audio -> transcribe -> generate reply -> synthesize -> send audio

Here’s a concise pseudocode main loop to illustrate the flow:

while call_active: audio_chunk = receive_audio_from_livekit() if is_silence(audio_chunk): continue transcript = ai_runtime.stt(audio_chunk, context_id) update_conversation_history(context_id, “user”, transcript) prompt = build_prompt(conversation_history[context_id]) model_reply = ai_runtime.generate_text(prompt) update_conversation_history(context_id, “agent”, model_reply) tts_audio = ai_runtime.text_to_speech(model_reply, voice=”friendly”) send_audio_to_livekit(tts_audio, target_participant=twilio_bridge)

This loop assumes you manage context_id and conversation history, and that you have helper functions for STT and TTS.

Conclusion

Recap the end-to-end process: accounts, config, code, testing, deployment, and monitoring

You’ve walked through creating an inbound voice AI agent: create accounts (Twilio, LiveKit, AI runtime), wire up configuration and secrets, implement a server to handle Twilio webhooks and LiveKit token generation, build or join a LiveKit room to route audio, process audio with an AI runtime to transcribe and respond, and test locally with ngrok before deploying to production. Each step needs validation and monitoring.

Highlight key success factors: secure env, audio handling, robust testing, and cost control

Key success factors are secure secret management, robust audio handling (codecs and resampling), effective context management, and rigorous testing across edge cases like call transfers and network jitter. Also monitor costs for trunking, hours of streaming, and AI runtime usage and optimize model calls to control spend.

Suggested next actions: run the Twilio test, iterate on prompts, and prepare for production deployment

Next, run a live Twilio test by calling your number, iterate on prompt design to improve agent responses, add telemetry and logging, prepare deployment artifacts (Docker images, cloud infra), and test failover scenarios. Consider load testing and adding rate limits or autoscaling.

Resources and references to consult: Twilio docs, LiveKit docs, Cursor/VAPI docs, and the Notion guide

Keep the Twilio and LiveKit documentation and your AI runtime docs at hand for API specifics and best practices. Maintain your Notion guide or README with configuration details, runbooks, and test scripts so you and your team can reproduce the setup or onboard others quickly.

Good luck — you’re now equipped to build an inbound voice AI agent that answers calls, maintains context, and routes audio end-to-end using LiveKit and Twilio.

If you want to implement Chat and Voice Agents into your business to reduce missed calls, book more appointments, save time, and make more revenue, book a discovery call here: https://brand.eliteaienterprises.com/widget/bookings/elite-ai-30-min-demo-call