Outbound Ai Caller with twillio

Outbound Ai Caller with twillio

  1. quickstart-js → Provides a simple HTML front-end and a server to create calls.
  • Create ./.env and populate your API key: ULTRAVOX_API_KEY=<YOUR_KEY_HERE>.

  • Install dependencies with pnpm install.

  • Run the front-end and back-end simultaneously with pnpm dev.

  1. simple-vanilla-html → Provides a vanilla HTML example. Note: this example requires you to manually create a call and then paste in the joinUrl.

  2. drdonut-nextjs → NextJS app that provides an AI agent for taking orders at a fictional drive-thru donut chain. Consult README.md in the project folder for more information.

    architecture diagram

Ultravox is a new kind of multimodal LLM that can understand text as well as human speech, without the need for a separate Audio Speech Recognition (ASR) stage. Building on research like AudioLM, SeamlessM4T, Gazelle, SpeechGPT, and others, Ultravox is able to extend any open-weight LLM with a multimodal projector that converts audio directly into the high-dimensional space used by LLM. We've trained versions on Llama 3, Mistral, and Gemma. This direct coupling allows Ultravox to respond much more quickly than systems that combine separate ASR and LLM components. In the future this will also allow Ultravox to natively understand the paralinguistic cues of timing and emotion that are omnipresent in human speech.

The current version of Ultravox (v0.4), when invoked with audio content, has a time-to-first-token (TTFT) of approximately 150ms, and a tokens-per-second rate of ~60 using a Llama 3.1 8B backbone. While quite fast, we believe there is considerable room for improvement in these numbers.

Ultravox currently takes in audio and emits streaming text. As we evolve the model, we'll train it to be able to emit a stream of speech tokens that can then be converted directly into raw audio by an appropriate unit vocoder.

Demo

See Ultravox in action on our demo page.

You can run the Gradio demo locally with just gradio. You can run the demo in "voice mode" which allows natural audio conversations with ultravox by running just gradio --voice_mode=True

Discord

Join us on our Discord server here.

Jobs

If you're interested in working on Ultravox fulltime, we're hiring! Check out our jobs page here.

Inference Server

You can try out Ultravox using your own audio content (as a WAV file) by spinning up an Ultravox instance on our partner, BaseTen: https://www.baseten.co/library/ultravox/. They offer free credits to get started.

If you're interested in running Ultravox in a real-time capacity, we offer a set of managed APIs as well. You can learn more about getting access to those here.

Model

You can download the latest weights from the Ultravox Hugging Face page.