One Cerebral
Posts
How Building AI Voice Agents can Help Your Product

How Building AI Voice Agents can Help Your Product

Co-Founder/CEO of Deepgram: Sam Charrington

Jiten Patel
February 21, 2025

Credit and Thanks: 
Based on insights from The TWIML AI Podcast with 
Sam Charrington.

Key Learnings

The evolution of audio AI is shifting towards perception and interaction models, moving beyond traditional speech-to-text frameworks.
Continuous learning and adaptation in audio models are essential for improving accuracy and relevance in specific use cases.
Founders should prioritize the integration of perception, understanding, and interaction layers to create more effective AI agents.
The complexity of diarization remains a challenge, highlighting the need for specialized models tailored to specific environments.
Voice AI applications are rapidly expanding across industries, presenting significant opportunities for startups to innovate.

_{Today’s Podcast Host:}_{Sam Charrington}

Title

Building AI Voice Agents with Scott Stephenson

Guests

Scott Stephenson

Guest Credentials

Scott Stephenson is the co-founder and CEO of Deepgram, a leading speech-to-text technology company he started in 2015 after transitioning from a career in particle physics. His background includes a PhD in particle physics from the University of Michigan and postdoctoral research at UC Davis, where he worked on dark matter detection experiments. Under Stephenson's leadership, Deepgram has grown significantly, offering real-time speech-to-text APIs in over 30 languages and expanding into text-to-speech capabilities.

Podcast Duration

1:01:15

This Newsletter Read Time

Approx. 5 mins

Deep Dive

Scott Stevenson emphasized the importance of perception, understanding, and interaction models over traditional speech-to-text and text-to-speech frameworks. He articulated a vision where audio models are not merely about transcribing speech but about creating intelligent agents capable of meaningful interactions. This shift is crucial for startup founders looking to innovate in the AI space, as it highlights the need for a more holistic approach to audio technology.

Stevenson pointed out that while significant advancements have been made in audio models, gaps still exist, particularly in handling low-quality audio environments, such as phone calls with background noise. He recounted his experiences from seven years ago when the accuracy of speech-to-text models was around 50% for such scenarios, illustrating the challenges that founders must navigate. The evolution of models like Whisper, which introduced self-supervised learning techniques, has improved accuracy, but the complexity of diarization—identifying who is speaking in a conversation—remains a significant hurdle. Founders should recognize that addressing these gaps requires not only technological innovation but also a deep understanding of the specific use cases they aim to serve.

Adaptation and specialization of audio models are essential for achieving high performance in diverse applications. Stevenson emphasized that while general models are beneficial, the future lies in fine-tuning these models to cater to specific industries, such as healthcare or customer service. For instance, a model designed for medical transcription must understand specialized terminology and context, which can only be achieved through targeted training. This approach not only enhances accuracy but also builds trust with users, as they see the model's ability to understand their unique needs.

Continuous learning is another critical aspect that Stevenson highlighted. He explained that the traditional supervised learning methods are evolving, allowing models to learn from real-time interactions and adapt over time. This capability is particularly relevant for startups aiming to create AI solutions that remain relevant in a rapidly changing environment. By investing in technologies that support continuous learning, founders can ensure their products evolve alongside user needs, ultimately leading to better user experiences.

The conversation also touched on the complexity of building comprehensive AI companies. Stevenson noted that Deepgram's approach integrates model building with infrastructure, allowing for a seamless experience when developing voice AI applications. This integration is vital for startups, as it enables them to focus on their core competencies while leveraging existing technologies to enhance their offerings. For example, the Deepgram voice AI API provides developers with the tools to create sophisticated voice applications without needing to build everything from scratch.

Stevenson’s insights into the future of voice AI in human-computer interaction are particularly relevant for founders. He envisions a landscape where voice agents become a primary interface for various applications, from automating customer service interactions to enhancing accessibility in healthcare. This shift will require startups to think beyond traditional models and consider how their products can facilitate more natural and intuitive interactions. By focusing on the agent framework, which encompasses perception, understanding, and interaction, founders can create solutions that not only meet user needs but also anticipate them.

In terms of use cases, Stevenson highlighted the growing demand for voice AI in sectors like healthcare and food service. For instance, AI agents can assist in scheduling appointments or providing information about services, significantly improving efficiency and user satisfaction. Founders should explore these verticals, as they present substantial opportunities for innovation and growth.

Ultimately, the conversation underscored the importance of interpretability and end-to-end training in developing effective audio models. As the technology matures, the ability to understand and control how models learn and adapt will be crucial for building trust with users. Founders should prioritize transparency in their AI systems, ensuring that users can comprehend how decisions are made and how the technology evolves over time.

By embracing these principles and focusing on the integration of perception, understanding, and interaction, startup founders can position themselves at the forefront of the audio AI revolution, creating solutions that not only meet current demands but also anticipate future needs.

Actionable Insights

Invest in technologies that support continuous learning to ensure your audio models remain competitive and relevant.
Focus on fine-tuning audio models for specific industries to enhance accuracy and user experience.
Explore the integration of voice AI in customer service to automate interactions and improve efficiency.
Encourage your team to adopt a proactive approach in developing AI solutions that address real user needs.
Leverage existing APIs, like the Deepgram voice AI API, to accelerate the development of your voice applications.

Mind Map

Key Quote

"The gold posts always change with time as they should with AI... you want to create models that are well versed in the world and can tackle many different problems."

Future Trends & Predictions

As the demand for voice AI solutions continues to grow, we can expect a surge in the development of specialized audio models that cater to specific industries. The integration of continuous learning capabilities will likely become a standard feature in AI systems, enabling them to adapt in real-time to user interactions. Additionally, the rise of voice agents as a primary interface for human-computer interaction will reshape how businesses engage with customers, leading to more personalized and efficient experiences. Founders should prepare for a future where voice technology is not just an add-on but a core component of their product offerings.

Check out the podcast here:

_{Latest in AI}

1. Simon Kohl, a former Google DeepMind scientist, has launched Latent Labs with $50 million in funding to revolutionize the field of biology through AI. The startup aims to make biology programmable by building AI foundation models that can create and optimize proteins, potentially reducing reliance on traditional wet lab experiments.

2. A team led by Caltech's Sergei Gukov has developed a new machine-learning algorithm capable of solving math problems requiring extremely long sequences of steps, some involving a million or more moves. The algorithm has been used to solve families of problems related to the Andrews–Curtis conjecture, a decades-old math problem. While not solving the main conjecture itself, the team disproved families of potential counterexamples that had remained open for about 25 years and made significant progress on another family of counterexamples open for 44 years.

3. While Anthropic has not officially announced a release date for Claude 4, industry speculation suggests it may debut in early to mid-2025, with some rumors hinting at an imminent release. Claude 4 is expected to build upon the advancements of the Claude 3.5 series, potentially offering an expanded context window, advanced multimodal capabilities, and improved coding and reasoning skills.

_{Startup World}

1. EnCharge AI, a startup spun out of Princeton University, has secured over $100 million in Series B funding led by Tiger Global to develop analog in-memory-computing AI chips. The company's technology aims to significantly reduce energy consumption for AI workloads, claiming to be 20 times more efficient than current solutions on the market. EnCharge plans to launch its first AI accelerator chips in 2025, targeting client computing devices like laptops and wearables, with the goal of enabling AI inference on local devices rather than relying on energy-intensive cloud data centers..

2. Apptronik, a humanoid robot manufacturer with notable partnerships including Nvidia, DeepMind, and NASA, has successfully raised $350 million in a Series A funding round. This substantial investment will likely accelerate the development and production of Apptronik's advanced humanoid robots, potentially bringing them closer to commercial applications. The significant funding and high-profile partnerships suggest growing interest and confidence in the potential of humanoid robotics technology for various industries and applications.

❝

_Analogy

Scott Stevenson’s vision for audio AI is like evolving from basic walkie-talkies to fluent interpreters. Early speech-to-text models, like static translations, captured only part of the message, often struggling with noisy environments. Stevenson’s approach transforms audio models into dynamic, conversational agents that don’t just hear—they truly understand and respond. Founders embracing this shift can build specialized, adaptive solutions, akin to teaching interpreters to master specific dialects for industries like healthcare or customer service. By focusing on perception, understanding, and interaction, startups can pioneer a future where voice technology feels less like a tool and more like a seamless partner.

What did you think of today's email?

Your feedback helps me create better emails for you!

Loved it

It was ok

Terrible

Thanks for reading, have a lovely day!

Jiten-One Cerebral

_{All summaries are based on publicly available content from podcasts. One Cerebral provides complementary insights and encourages readers to support the original creators by engaging directly with their work; by listening, liking, commenting or subscribing.}

Reply

or to participate.