Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

New AI research focuses on "interaction models" that handle real-time, full-duplex audio. This allows an AI to respond even while the user is still speaking—a significant step beyond current turn-based models and closer to the fluid, overlapping nature of natural human conversation.

Related Insights

The primary reason voice assistants feel robotic is their failure to process audio while speaking. They get confused by simple interjections like "yeah" or attempts to interrupt. OpenAI's new "BIDI" model aims to solve this by listening and updating its response in real-time for a more natural conversation.

Voice-to-voice AI models promise more natural, low-latency conversations by processing audio directly. However, they are currently impractical for many high-stakes enterprise applications due to a hallucination rate that can be eight times higher than text-based systems.

Current chat interfaces are compared to the command-line: they require users to learn a specific, procedural way of communicating ('prompt engineering'). New interaction models, which allow for natural, multimodal communication, could be AI's 'GUI moment,' democratizing access by letting users focus on the task, not the tool.

Until brain-computer interfaces are viable, the highest bandwidth way to interact with AI is through speaking commands (voice out) and receiving information visually (visual in), whether on a screen or via glasses. This is because humans speak significantly faster than they can type.

The interface for AI agents is becoming nearly frictionless. By setting up a voice-to-voice loop via an app like Telegram, users can issue complex commands by simply holding down a button and speaking. This model removes the cognitive load of typing and makes interaction more natural and immediate.

While most focus on human-to-computer interactions, Crisp.ai's founder argues that significant unsolved challenges and opportunities exist in using AI to improve human-to-human communication. This includes real-time enhancements like making a speaker's audio sound studio-quality with a single click, which directly boosts conversation productivity.

The next wave of AI assistants focuses on "interaction" or "bi-directional" models that can process information and respond in real-time, allowing users to interrupt them naturally. Startups like Thinking Machines Lab are competing directly with giants like OpenAI to create a more fluid, human-like conversational experience, moving beyond today's turn-based models.

Advanced models are moving beyond simple prompt-response cycles. New interfaces, like in OpenAI's shopping model, allow users to interrupt the model's reasoning process (its "chain of thought") to provide real-time corrections, representing a powerful new way for humans to collaborate with AI agents.

Sam Altman highlights a key feature in new coding models: the ability for a user to interrupt and steer the AI while it's in the middle of a multi-hour task. This shifts the workflow from one-shot prompting to dynamic management, making the AI feel more like a true coworker you can course-correct in real time.

A new AI architecture from Thinking Machines Lab processes user interaction in continuous 200ms 'micro-turns' rather than waiting for a user to finish speaking. This allows for simultaneous listening and responding, moving AI from a static, email-like exchange to a dynamic, real-time partnership.