One Cerebral
Posts
Does Synthetic Data Play a Role in AI Training?

Does Synthetic Data Play a Role in AI Training?

Co-Founder/CTO of Poolside: Eiso Kant

Jiten Patel
February 13, 2025

Credit and Thanks: 
Based on insights from 20VC with Harry Stebbings.

Key Learnings

Poolside aims to bridge the gap between human and machine intelligence by focusing on AI for software development, emphasizing the importance of capturing the entire coding process, not just final outputs.
The success of AI models hinges on the quality and scale of data, with a significant need for comprehensive datasets that include intermediate reasoning and execution feedback.
Synthetic data plays a crucial role in AI training, but it must be validated through real-world execution to ensure its effectiveness and relevance.
The competitive landscape of AI development requires substantial funding and compute resources, with estimates suggesting that entering the race for foundational models could cost upwards of $100 billion.
The future of AI will likely see a consolidation of smaller players as larger companies leverage their resources to acquire talent and technology, shaping the industry's landscape.

_{Today’s Podcast Host:}_{Harry Stebbings}

Title

CTO @Poolside: Raising $500M To Compete in the Race for AGI

Guests

Eiso Kant

Guest Credentials

Eiso Kant is the co-founder and CTO of Poolside, an AI startup focused on software engineering that recently raised $500 million in a Series B round, valuing the company at $3 billion. His extensive entrepreneurial career includes founding Athenian, a data-enabled engineering platform, and serving as CEO and co-founder of source{d}, a company dedicated to applying AI to code. Kant's earlier ventures include Twollars, a platform for raising money for charities through Twitter, and various web development and online publishing projects.

Podcast Duration

1:19:00

This Newsletter Read Time

Approx. 5 mins

Deep Dive

Poolside’s company mission is to bridge the gap between human and machine intelligence, particularly in the realm of software development. Poolside aims to create advanced AI systems capable of understanding and generating code, a task that requires not just the final product but also the iterative thinking and reasoning that lead to successful coding outcomes. Kant emphasizes that the existing datasets primarily capture the end result, neglecting the crucial intermediate steps that are essential for training AI models effectively. This gap in data is a significant hurdle, as it limits the AI's ability to learn from the entire coding process, which is often complex and non-linear.

Kant draws parallels between the deterministic nature of coding and the reinforcement learning techniques used in AI, citing the success of DeepMind's AlphaGo as a pivotal moment in AI history. He explains that AlphaGo's initial training on historical game data was only the beginning; the real breakthrough came when the model learned through self-play, exploring various strategies and learning from its successes and failures. This approach highlights the importance of capturing not just the final output but also the reasoning and decision-making processes that lead to that output. Kant believes that similar methodologies can be applied to software development, where capturing the entire coding journey—from task assignment to debugging—can significantly enhance AI capabilities.

The conversation also delves into the current bottlenecks in AI progress, particularly the interplay between compute, data, and models. Kant asserts that while advancements in algorithms are crucial, the real differentiator lies in the quality and quantity of data available for training. He notes that the AI industry is experiencing a significant compute shortage, which has created a competitive landscape where access to powerful hardware is essential. Kant's perspective aligns with Larry Ellison's assertion that a staggering $100 billion is necessary to enter the race for foundational models, underscoring the immense capital required to build the infrastructure needed for advanced AI.

Kant discusses the value of synthetic data in overcoming the limitations of traditional datasets. He explains that while synthetic data can be generated to fill gaps, it must be validated through execution feedback to ensure its usefulness. This process involves using models to generate potential solutions and then evaluating those solutions based on their performance in real-world scenarios. Kant emphasizes that this feedback loop is critical for improving AI models, as it allows them to learn from both their outputs and the reasoning behind those outputs.

As for scaling laws in AI, Kant expresses optimism about the potential for continued advancements. He believes that the industry has only begun to scratch the surface of what is possible with larger models and more extensive datasets. However, he acknowledges that the cost of training these models is a significant concern. Over the next 12 to 24 months, Kant predicts that the costs associated with model training will continue to decrease due to increased competition among major players like Nvidia, Google, and Amazon. This price war is expected to drive down costs, making advanced AI more accessible.

Kant emphasized that the value in the AI landscape is derived from three critical components: models, chips, and applications. He noted that advanced AI models serve as the foundation for various applications, with their effectiveness heavily reliant on the scale and quality of data used for training. Chips, particularly GPUs, are essential for powering these models, and companies that can produce their own chips gain a competitive edge. Ultimately, the true economic value is realized through the applications of these models in solving real-world problems, particularly in areas like software development, where significant productivity gains can be achieved

Kant also addresses the future of model distillation, suggesting that while large models are essential for achieving high levels of performance, there will always be a need for smaller, more efficient models that can be deployed in real-world applications. He believes that the industry will continue to evolve towards a model where large, capable models are distilled into smaller versions that retain much of their intelligence while being more cost-effective to run.

The conversation touches on the relationship between cash and compute access, with Kant asserting that while cash is critical for acquiring compute resources (hence raising $500m), it does not guarantee access. He explains that the current supply-demand imbalance in the GPU market means that even well-funded startups may struggle to secure the necessary hardware. This reality has led to a strategic focus on building relationships with hardware providers and optimizing compute usage.

The conversation concludes with a reflection on work ethic and work-life balance, with Kant asserting that the drive to succeed in the race for AGI requires a level of commitment that may challenge traditional notions of work-life balance. He acknowledges that while the pursuit of ambitious goals can lead to intense pressure and long hours, it is essential for those involved to be passionate about their work. Kant emphasizes that this dedication is not merely about sacrificing personal time but about fostering a culture where team members are motivated by a shared vision and the excitement of contributing to groundbreaking advancements in AI. Ultimately, he believes that this commitment can lead to both personal fulfillment and significant technological progress.

Actionable Insights

Prioritize the development of comprehensive datasets that capture the entire coding process, including intermediate reasoning and execution feedback, to enhance AI training.
Invest in reinforcement learning techniques to improve model efficiency and adaptability in real-world applications.
Establish strategic partnerships with hardware providers to secure access to necessary compute resources for training advanced AI models.
Explore the use of synthetic data generation to fill gaps in training datasets, ensuring that generated data is validated through execution feedback.
Foster a company culture that emphasizes passion and commitment to the race for AGI, attracting talent that shares a vision for ambitious technological advancements.

Why it’s Important

Understanding the significance of comprehensive data and innovative learning techniques is crucial for companies aiming to lead in the AI space. As the race towards AGI intensifies, these elements will determine which organizations can effectively bridge the gap between human and machine intelligence.

What it Means for Thought Leaders

For thought leaders, the discussion underscores the necessity of fostering an environment that encourages innovation in AI development. It highlights the importance of collaboration across sectors and the need for a strategic approach to data utilization and model training. As the landscape evolves, thought leaders must advocate for policies and practices that support the growth of AI capabilities while addressing ethical considerations.

Mind Map

Key Quote

"We do not get the luxury of stumbling on the capabilities race or to go to market ready to go; this is a race."

Future Trends & Predictions

As the demand for AI capabilities continues to grow, we can expect a surge in investment towards developing more sophisticated reinforcement learning techniques and comprehensive datasets. The trend towards synthetic data generation will likely become more prevalent, enabling companies to train models more effectively. Additionally, as competition intensifies, we may see a consolidation of smaller AI firms into larger entities, driven by the need for scale and resources to compete in the AGI race.

Check out the podcast here:

_{Latest in AI}

1. Stanford's Virtual Lab employs AI agents as collaborators in scientific research, enabling interdisciplinary approaches to tackle complex challenges. Researchers have showcased its potential by designing nanobodies that can bind to the virus that causes COVID-19, proposing nearly 100 of these structures in a fraction of the time it would take a human research group.

2. Alibaba's Qwen team has released Qwen2.5-VL, a new multimodal AI model that can process text, images, and videos, rivaling models from OpenAI and Google. Qwen2.5-VL is available for testing on Alibaba’s Qwen Chat app and for download from AI development platform Hugging Face.

3. Microsoft has not released benchmark-topping synthetic data models on Hugging Face for commercial use under an MIT license. While Microsoft has been working on synthetic data generation and has released models like Phi-4 and Orca-2, these models are primarily for research purposes and are not explicitly licensed for commercial use. Microsoft's work on synthetic data, including projects like Phi-3 and Orca, focuses on developing powerful language models without compromising privacy, but there's no indication of a commercial release under an MIT license.

_{Useful AI Tools}

1. 3D Detection of Traffic Lights: Novel 3D annotation method for traffic lights and signs in autonomous driving.

2. LLMs' Guardrails: Reasoning-based LLM safeguard for enhanced safety and explainability.

3. Evaluating Long-Context Learning: Comparative study of LLM performance with extended document contexts.

_{Startup World}

1. ElevenLabs, a Polish-founded AI audio company based in New York, raised $180 million in Series C funding. The startup, which develops advanced text-to-speech technology for publishers and creators, is now valued at $3.3 billion. Investors in this round included a16z, ICONIQ Growth, NEA, and several others.

2. ThreatMark, a Czech Republic-based startup offering a machine learning platform to protect financial institutions from digital fraud, secured €22.3 million in venture and convertible note funding. The company plans to use the funds to expand into the UK, accelerate R&D, and grow its team.

3. TravelPerk, a Barcelona-based platform for managing business travel globally, raised $200 million in an oversubscribed Series E funding round. This substantial investment underscores the growing demand for efficient business travel management solutions in the post-pandemic era.

❝

_Analogy

AI's evolution mirrors the journey of a chess prodigy—first learning from grandmasters' past games, then refining skills through endless self-play. Just as AlphaGo’s true genius emerged from iterating on its own moves, the future of AI hinges on not just generating outputs but understanding the reasoning behind them. However, this progress is bottlenecked by a shortage of "chessboards"—the compute power needed to train ever-larger models. While giants like Nvidia control the game pieces, smaller players must optimize strategy, proving that intelligence isn’t just about brute force but about making the smartest moves with the resources at hand.

What did you think of today's email?

Your feedback helps me create better emails for you!

Loved it

It was ok

Terrible

Thanks for reading, have a lovely day!

Jiten-One Cerebral

_{All summaries are based on publicly available content from podcasts. One Cerebral provides complementary insights and encourages readers to support the original creators by engaging directly with their work; by listening, liking, commenting or subscribing.}

Reply

or to participate.