We scan new podcasts and send you the top 5 insights daily.
Jazmia Henry defines her "full stack" role as a four-stage process: obsessive data curation, custom tokenizer/embedding development, model training (pre-training and RL), and finally, optimizing the trained model for efficient inference, which is often overlooked.
A significant part of Unlearn.ai's value is not just its advanced generative models, but its painstaking data harmonization work. The company builds internal machine learning tools to unify complex, disparate data sources like clinical trials and real-world data, which is the essential foundation for creating powerful models.
Cohere's co-founder explains that creating large language models is enormously resource-intensive and complex, requiring vast compute, data, and specialized talent working in unison. This high barrier to entry is why the foundational model space is concentrated among a few players, similar to the aerospace industry.
Humane developed a foundational model from scratch trained on proprietary Arabic data. The primary goals were not to compete with global leaders, but to understand cultural nuances, address language biases, and, most importantly, train the internal team on building the entire AI stack from the ground up.
Advanced model training is not just about scraping the web. It's a multi-stage process that starts with massive web data, is refined by human-created examples and ratings (SFT), and is then scaled using reinforcement learning on data generated by the model itself. This synthetic data loop is now a critical component.
Training models like GPT-4 involves two stages. First, "pre-training" consumes the internet to create a powerful but unfocused base model (“raw brain mass”). Second, "post-training" uses expert human feedback (SFT and RLHF) to align this raw intelligence into a useful, harmless assistant like ChatGPT.
Optimizing transformer inference, specifically the separation of pre-fill (KV cache building) and decode (token generation), is becoming a foundational skill. Chris Fregly predicts this complex topic, known as disaggregated pre-fill decode, will be a core component of AI engineering interviews at top labs within two years.
The key advantage of labs like OpenAI isn't just pre-training, but their ability to continuously post-train models on product-specific data. This tight feedback loop between the model and the product is their real competitive moat, which Prime Intellect aims to democratize for all companies.
Customizing a base model with proprietary data is only effective if a company possesses a massive corpus. At least 10 billion high-quality tokens are needed *after* aggressive deduplication and filtering. This high threshold means the strategy is only viable for the largest corporations, a much higher bar than most businesses realize.
For low-latency applications, start with a small model to rapidly iterate on data quality. Then, use a large, high-quality model for optimal tuning with the cleaned data. Finally, distill the capabilities of this large, specialized model back into a small, fast model for production deployment.
Criteo builds multiple, specialized foundation models (for products, user timelines, etc.) rather than a single monolithic one. The embeddings from these models are made available across the company, serving as a "warm start" to accelerate the development and improve the performance of new AI products.