Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

A key advancement in Fable is its ability to exercise judgment. When receiving feedback from a human or another AI, it can analyze the suggestion and disagree, explaining why its original approach is better for the given context, thus mimicking a senior collaborator.

Related Insights

Meaningful AI criticism no longer comes from armchair philosophy; it requires deep mathematical and engineering proofs. AIs like GPT-3 can generate criticism that is just as good, if not better, than human critics who lack a technical understanding of how the models are built.

A key indicator of advancing AI is the ability to not just answer a question, but to evaluate its premise. GPT-5.5 demonstrates this by identifying and gently rejecting a nonsensical prompt ('Should I drive to the car wash?') while maintaining a helpful, conversational tone, a historically difficult task for LLMs.

When an AI model generates code, the focus of a pull request review changes. It's no longer just about whether the code works. The engineer must now explain and defend the architectural choices the model made, demonstrating they understand the implications and haven't just accepted a default, suboptimal solution.

By programming one AI agent with a skeptical persona to question strategy and check details, the overall quality and rigor of the entire multi-agent system increases, mirroring the effect of a critical thinker in a human team.

To overcome the challenge of reviewing AI-generated code, have different LLMs like Claude and Codex review the code. Then, use a "peer review" prompt that forces the primary LLM to defend its choices or fix the issues raised by its "peers." This adversarial process catches more bugs and improves overall code quality.

Prompting a different LLM model to review code generated by the first one provides a powerful, non-defensive critique. This "second opinion" can rapidly identify architectural issues, bugs, and alternative approaches without the human ego involved in traditional code reviews.

While correcting AI outputs in batches is a powerful start, the next frontier is creating interactive AI pipelines. These advanced systems can recognize when they lack confidence, intelligently pause, and request human input in real-time. This transforms the human's role from a post-process reviewer to an active, on-demand collaborator.

A powerful evaluation technique is to ask an AI agent to analyze its own poor output. The agent can review its context and process, explain why it made a mistake, and even suggest how to update its own instructions to prevent future errors.

AI models often default to being agreeable (sycophancy), which limits their value as a thought partner. To get valuable, critical feedback, users must explicitly instruct the AI in their prompt to take on a specific persona, such as a skeptic or a harsh editor, to challenge their ideas.

Shopify's CTO argues against running many AI agents in parallel. A more effective, higher-quality method is a "critique loop," where one agent (ideally using a different model) reviews and suggests improvements to another's work. Though slower, this process significantly boosts code quality.