The 'Duccio' Framework Signals a Shift to Multimodal AI for Smarter Recommendations

Related Insights

Businesses Widely Adopt Multimodal AI for Input, But Lag in Generating Multimodal Output

While companies readily use models that process images, audio, and text inputs, the practical application of generating multimodal outputs (like video or complex graphics) remains rare in business. The primary output is still text or structured data, with synthesized speech being the main exception.

2025 was the year of agents, what's coming in 2026?

Practical AI·6 months ago

Future Hyper-Personalization Will Be a Hybrid of Cloud and On-Device AI

The future of personalization may involve a two-step process. A centralized AI (like Criteo's) will provide strong recommendations. Then, a smaller, privacy-centric model running locally on the user's device (e.g., in their glasses) will perform the final, hyper-personalized adjustments, keeping the most sensitive data private.

Milliseconds to Match: Criteo's AdTech AI & the Future of Commerce w/ Diarmuid Gill & Liva Ralaivola

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·2 months ago

Natively Multimodal Embeddings Eliminate a Key Bottleneck for Enterprise Knowledge Retrieval

Google's Embedding 2 model is a significant infrastructure upgrade because it is 'natively multimodal.' This allows AI to directly understand and retrieve images, diagrams, and text without first converting non-text data into lossy captions. This makes internal knowledge bases and co-pilots dramatically more effective and accurate for enterprises.

Why Google Workspace CLI is a Big Deal

The AI Daily Brief: Artificial Intelligence News and Analysis·4 months ago

Fuse Image and Text Vector Embeddings to Create Powerful Semantic Search

To move beyond keyword search in their media archive, Tim McLear's system generates two vector embeddings for each asset: one from the image thumbnail and another from its AI-generated text description. Fusing these enables a powerful semantic search that understands visual similarity and conceptual relationships, not just exact text matches.

“Nobody wanted to do this work”: How Emmy Award–winning filmmakers use AI to automate the tedious parts of documentaries

How I AI·8 months ago

The Next AI Wave Isn't Language Models, It's Multi-Sensory World Models

The current focus on LLMs is a temporary phase. The true leap towards AGI will come from multi-sensory models that can process and integrate visual, auditory, and other data streams simultaneously, much like a human does. This moves AI from text generation to real-world understanding.

Trump-Xi Summit, Benioff: "Not My First SaaSpocalypse," OpenAI vs Apple, Multi-Sensory AI, El Niño

All-In with Chamath, Jason, Sacks & Friedberg·2 months ago

The Next AI Frontier is 'Anything In, Anything Out' Multimodal Mega-Models

The future of creative AI is moving beyond simple text-to-X prompts. Labs are working to merge text, image, and video models into a single "mega-model" that can accept any combination of inputs (e.g., a video plus text) to generate a complex, edited output, unlocking new paradigms for design.

Where Does Consumer AI Stand at the End of 2025?

The a16z Show·6 months ago

The Future AI Moat Is in Complex Non-Text Models, Not Commoditized LLMs

While today's focus is on text-based LLMs, the true, defensible AI battleground will be in complex modalities like video. Generating video requires multiple interacting models and unique architectures, creating far greater potential for differentiation and a wider competitive moat than text-based interfaces, which will become commoditized.

OpenAI's Code Red, Sacks vs New York Times, New Poverty Line?

All-In with Chamath, Jason, Sacks & Friedberg·7 months ago

Traditional RAG Fails by Ignoring Visual Data; Multimodal Models Are the Fix

Standard Retrieval-Augmented Generation (RAG) systems often fail because they treat complex documents as pure text, missing crucial context within charts, tables, and layouts. The solution is to use vision language models for embedding and re-ranking, making visual and structural elements directly retrievable and improving accuracy.

The NVIDIA Nemotron Stack For Production Agents

Machine Learning Tech Brief By HackerNoon·5 months ago

Recommender Systems Prove AIs Can Be Superhuman at Predicting Human Tastes

The common belief that AI can't truly understand human wants is debunked by existing technology. Adam D'Angelo points out that recommender systems on platforms like Instagram and Quora are already far better than any individual human at predicting what a user will find engaging.

Amjad Masad & Adam D’Angelo: How Far Are We From AGI?

The a16z Show·8 months ago

Google Uses Specialized Models Like Veo as R&D Proving Grounds for Its Foundational Gemini Model

Google's strategy involves building specialized models (e.g., Veo for video) to push the frontier in a single modality. The learnings and breakthroughs from these focused efforts are then integrated back into the core, multimodal Gemini model, accelerating its overall capabilities.

How Google’s Nano Banana Achieved Breakthrough Character Consistency

Training Data·8 months ago

Get your free personalized podcast brief

Related Insights