We scan new podcasts and send you the top 5 insights daily.
To overcome the limitation of having only ~100 years of real financial data, CFM is exploring the use of Generative AI to create vast synthetic market histories. This would allow them to train and test their quantitative models on a scale of a "million years," making them more robust.
To break the data bottleneck in AI protein engineering, companies now generate massive synthetic datasets. By creating novel "synthetic epitopes" and measuring their binding, they can produce thousands of validated positive and negative training examples in a single experiment, massively accelerating model development.
Synthetic data serves as an efficient first step for training specialized AI, particularly when a larger model teaches a smaller one. However, it is insufficient on its own. The final, crucial stage always requires expensive "human signal"—feedback from subject matter experts—to achieve true performance.
Advanced model training is not just about scraping the web. It's a multi-stage process that starts with massive web data, is refined by human-created examples and ratings (SFT), and is then scaled using reinforcement learning on data generated by the model itself. This synthetic data loop is now a critical component.
To build confidence in AI's ability to forecast the future, researchers are training "historical LLMs" on data ending in a specific year, like 1930. They then test the model's ability to predict text from a later period, like 1940. This process of historical validation helps calibrate and improve models predicting our own future.
Instead of using sensitive company information, you can prompt an AI model to create realistic, fake data for your business. This allows you to experiment with powerful data visualization and analysis workflows without any privacy or security risks.
Hudson River Trading shifted from handcrafted features based on human intuition to training models on raw, internet-scale market data. This emergent approach, similar to how ChatGPT is trained, has entirely overtaken traditional quant methods that relied on simpler techniques like linear regression.
Static data scraped from the web is becoming less central to AI training. The new frontier is "dynamic data," where models learn through trial-and-error in synthetic environments (like solving math problems), effectively creating their own training material via reinforcement learning.
Expect 2026 to be the breakout year for synthetic data. Companies in highly regulated sectors like healthcare and finance are realizing it offers a compliant and low-risk method to test and train AI models without compromising sensitive customer information, enabling innovation in marketing, research, and CX.
To combat unreliable backtests, CFM is building "meta-models" that quantitatively predict whether a new model's results are overfitted. This systematic approach aims to replace human judgment with a data-driven process for deciding if a trading model is robust enough for production.
In global macro, theses often rely on small data sets (e.g., few historical recessions). AI expands this sample size by identifying fundamentally similar crises across different countries and eras, or by so deeply modeling the economic logic that a large sample becomes less necessary for conviction.