- One Cerebral
- Posts
- Is AI at a data plateau?
Is AI at a data plateau?
Professor & OpenAI Board of Directors: Zico Kolter
Credit and Thanks:
Based on insights from 20VC by Harry Stebbings.
Today’s Podcast Host: Harry Stebbings
Title
OpenAI's Newest Board Member on The Biggest Questions and Concerns in AI Safety
Guest
Zico Kolter
Guest Credentials
Zico Kolter is a Professor of Computer Science and head of the Machine Learning Department at Carnegie Mellon University, recently appointed to OpenAI's Board of Directors and its Safety and Security Committee. His research focuses on AI safety, alignment, and robustness of machine learning systems, with notable contributions including developing methods for creating deep learning models with guaranteed robustness. Kolter's career spans academia and industry, having served as Chief Data Scientist at C3.ai, Chief Expert at Bosch, and Chief Technical Advisor at Gray Swan, an AI safety startup.
Podcast Duration
1:03:45
This Newsletter Read Time
Approx. 5 mins
Brief Summary
They explore the implications of large language models (LLMs), the challenges of data utilization, and the potential for achieving artificial general intelligence (AGI) within the next few decades. Kolter emphasizes the importance of understanding AI's capabilities and the need for responsible development and regulation.
Deep Dive
Zico Kolter, begins by elucidating the foundational mechanics behind modern AI technologies, particularly large language models (LLMs). At their core, these models operate by predicting the next word in a sequence based on vast amounts of data sourced from the internet. Kolter emphasized that while this may seem simplistic—merely predicting words—it is a profound scientific achievement that results in coherent and intelligent outputs. This capability challenges the notion that LLMs lack true intelligence, as the emergent properties of these models demonstrate a level of understanding that is both surprising and significant.
A critical aspect of Kolter's discussion revolved around the availability of data and the misconceptions surrounding a potential data shortage. He argued that while much of the high-quality, easily accessible text data has been utilized, there remains a vast reservoir of untapped resources, particularly in multimodal formats such as video and audio. For instance, he pointed out that the current models are trained on approximately 30 terabytes of text data, which is a minuscule fraction of the total data available. The real challenge lies not in the scarcity of data but in the computational capacity required to process and leverage this diverse information effectively. Kolter suggested that the future of AI will hinge on our ability to harness these additional data types, which could significantly enhance model performance.
Despite concerns about data limits, Kolter asserted that AI performance does not appear to be plateauing. He noted that larger models consistently outperform smaller ones, even when trained on the same datasets. This observation suggests that there is still room for improvement and that the field has not yet reached its full potential. He highlighted that current algorithms are not fully extracting the maximum information from the data available, indicating that there are still many deductions and inferences to be made. As models grow in size and sophistication, they will likely develop the capability to better utilize existing data, further driving advancements in AI.
The rapid commoditization of AI technologies presents both opportunities and challenges. Kolter acknowledged the proliferation of open-source models and the increasing accessibility of AI tools, which have democratized the technology. However, he cautioned that this commoditization could lead to a dilution of quality and a race to the bottom in terms of performance. Companies that succeed in this landscape will be those that can differentiate themselves through innovation and the ability to leverage their unique data and expertise effectively.
A significant theme in the discussion was the dual pursuit of artificial general intelligence (AGI) and profitable AI products. Kolter posited that these goals are not mutually exclusive; advancements in AI that lead to more capable systems can also yield commercially viable applications. He expressed optimism that AGI could be achieved within his lifetime, estimating a timeline of 4 to 50 years. This perspective reflects a growing belief in the potential for AI to evolve beyond narrow applications and into more generalized forms of intelligence.
However, the conversation also delved into the darker implications of AI technology, particularly concerning misinformation and the erosion of trust in objective reality. Kolter articulated a profound concern that the proliferation of AI-generated content could lead to a societal landscape where individuals no longer believe anything they see or read. This phenomenon, he argued, is not solely a product of AI but rather an accelerant of existing trends in human cognition and belief systems. The challenge lies in restoring trust in information sources while navigating the complexities introduced by AI.
Safety concerns surrounding AI were also a focal point of Kolter's insights. He outlined a hierarchy of risks associated with AI systems, emphasizing the need for robust safety measures as these technologies become more integrated into critical infrastructure. One of his primary concerns is the models' ability to follow instructions reliably, as they can be manipulated to produce harmful outputs. This unpredictability poses significant risks, particularly as AI systems are deployed in sensitive areas such as cybersecurity and public safety.
The discussion culminated in a nuanced examination of the implications of releasing open-source AI models. While Kolter acknowledged the benefits of open-source development for research and innovation, he also warned of the potential dangers associated with unrestricted access to powerful AI capabilities. He argued that there may come a time when the risks of releasing certain models outweigh the benefits, particularly if those models possess capabilities that could be exploited for malicious purposes. This delicate balance between fostering innovation and ensuring safety will be a critical consideration for the future of AI development.
Key Takeaways
The availability of data for AI training is not as limited as commonly perceived; significant untapped resources exist.
Larger AI models continue to demonstrate improved performance, indicating that the field has not yet plateaued.
Ethical concerns surrounding misinformation and AI's impact on societal trust necessitate careful regulation and oversight.
Actionable Insights
Organizations should invest in exploring and utilizing diverse data types, including video and audio, to enhance AI model training.
Researchers and developers should focus on scaling up model sizes while optimizing algorithms to maximize performance gains.
Stakeholders in AI development must engage in discussions about ethical implications and contribute to the creation of regulatory frameworks that address misinformation.
Why it’s Important
Understanding the dynamics of data availability and model performance is crucial for advancing AI technology responsibly. As AI systems become more integrated into daily life, the potential for misinformation and ethical dilemmas increases, making it imperative for developers and policymakers to collaborate on effective solutions. This conversation highlights the need for a balanced approach that fosters innovation while safeguarding societal values.
What it Means for Thought Leaders
For thought leaders in technology and policy, the insights from this podcast underscore the importance of staying informed about the rapid advancements in AI and their implications. They must advocate for responsible AI practices and engage in interdisciplinary dialogues that bridge the gap between technological innovation and ethical considerations. This proactive stance will be essential in shaping a future where AI serves the greater good.
Mind Map

Key Quote
"The real negative outcome is that people are not going to believe anything that they see. It didn't even need AI to get there, but AI is absolutely an accelerant for this process”.
Future Trends & Predictions
As AI technology continues to evolve, we can expect a growing emphasis on multimodal data integration, leading to more sophisticated and capable AI systems. The conversation suggests that the timeline for achieving AGI may be shorter than previously anticipated, with predictions ranging from 4 to 50 years. Additionally, the increasing commoditization of AI models will likely spur innovation, but it will also necessitate more stringent regulatory measures to mitigate risks associated with misinformation and ethical misuse.
Check out the podcast here:
Latest in AI
1. Amazon Prime Video is launching a new feature called "AI Topics," designed to enhance content discovery by providing more precise recommendations based on user interests. This innovative tool moves beyond traditional algorithms by categorizing content into thematic sections such as "mind-bending sci-fi" and "fantasy quests," allowing users to explore diverse genres more effectively. Currently in limited beta testing on select devices, AI Topics aims to prevent users from reaching "dead ends" in their viewing journey by enabling them to refine recommendations through related topics.
2. Airbnb is implementing machine learning technology this holiday season to prevent unauthorized parties at its rentals, particularly over New Year's Eve. The system analyzes various signals, such as the length of stay and the distance from the guest's location, to identify and block high-risk bookings for entire home listings. This initiative aims to promote responsible travel and ensure a positive experience for hosts and communities, building on previous successes that have significantly reduced party incidents since the measures were first introduced.
3. Meta has introduced the Byte Latent Transformer (BLT), a novel approach to language model training that replaces traditional tokens with learned patches. This method, detailed in their recent publication, demonstrates improved scaling efficiency and performance compared to token-based models. BLT operates directly on raw bytes, eliminating the need for tokenization and vocabulary management, which simplifies the training process and reduces computational overhead. Meta has also released the training code for BLT, allowing researchers and developers to explore and build upon this innovative technique for more efficient language model development.
Useful AI Tools
1. Microsoft has released a package that can convert any docx, xslx, or ppt files to markdown for efficient use as context for a language model.
2. ChatGPT Projects released Group files, chats, and custom instructions in one place for better organization and streamlined interactions
3. Pika 2.0 - New video generation model with ‘ingredients’ to incorporate user’s own images into outputs with improved motion and animation
Startup World
1. Lockheed Martin has established a new subsidiary called Astris AI, aimed at accelerating the adoption of artificial intelligence solutions within the U.S. defense sector and high-assurance commercial industries. This initiative will provide access to Lockheed Martin's advanced machine learning operations (MLOps) and generative AI platforms, designed to meet stringent security and compliance requirements. By offering modular, adaptable solutions and comprehensive consultative engineering services, Astris AI seeks to empower organizations to efficiently develop and deploy secure AI technologies in regulated environments.
2. The Starlink competitor - The European Union's IRIS² satellite constellation is set to become operational by 2030 - it will provide secure connectivity to Europe and secure connectivity and high-speed internet access across Europe and underserved regions. This ambitious project, which involves approximately 290 satellites in both low Earth orbit (LEO) and medium Earth orbit (MEO), has a total budget of €10.6 billion, with significant funding from both public and private sectors. Designed as a competitor to SpaceX's Starlink, IRIS² will enhance communication capabilities for government entities, businesses, and citizens while addressing connectivity gaps in remote areas.
3. A new immersive VR experience titled "Tonight With The Impressionists, Paris 1874," created by the French startup Excurio in collaboration with the Musée d’Orsay, allows visitors to explore 19th-century Paris and its iconic Impressionist art scene. The experience takes participants through the streets of Paris, into the first Impressionist exhibition, and to other significant locations in art history, all while accommodating over 100 users simultaneously without the need for bulky equipment. Despite some logistical challenges, such as navigating real-world obstacles while immersed in a virtual environment, the experience offers a unique opportunity to engage with art and history in an innovative way.
Analogy
Kolter likens modern AI to a powerful telescope—seemingly simple in its function of “predicting the next word,” yet profound in its ability to reveal vast and unexpected depths of intelligence. While skeptics see a basic mechanism, the emergent properties of large language models expose patterns and understanding, much like how a telescope unveils hidden galaxies. As AI scales with untapped data—audio, video, and beyond—its potential grows exponentially. Yet, as with telescopes aimed at the unknown, there’s a balance: democratized access fosters discovery, but misuse could distort the view, eroding trust in what’s real. The challenge is navigating this duality.
Thanks for reading, have a lovely day!
Jiten-One Cerebral
All summaries are based on publicly available content from podcasts. One Cerebral provides complementary insights and encourages readers to support the original creators by engaging directly with their work; by listening, liking, commenting or subscribing.
Reply