To avoid the rapid depreciation of hard-coded systems as LLMs improve, Blitzy's architecture is dynamic. Agents are generated just-in-time, with prompts written and tools selected by other agents based on the latest model capabilities and the specific task requirements.
Relying solely on semantic clustering (RAG) is inaccurate for complex domains like code. Blitzy combines a deep, relational knowledge graph with semantic understanding to accurately retrieve context, using the semantic match as a map to the source of truth rather than the truth itself.
The concept isn't about fitting a massive codebase into one context window. Instead, it's a sophisticated architecture using a deep relational knowledge graph to inject only the most relevant, line-level context for a specific task at the exact moment it's needed.
Relying on a single model family for generation and review is suboptimal. Blitzy found that using models from different developers (e.g., OpenAI, Anthropic) to check each other's work produces tremendously better results, as each family has distinct strengths and reasoning patterns.
When Blitzy's system fails to complete the final portion of a project, it's rarely a simple coding error. It's typically due to systemic issues a human would also struggle with, such as contradictory requirements in the spec or a situation where fixing one end-to-end test breaks another.
Fine-tuning creates model-specific optimizations that quickly become obsolete. Blitzy favors developing sophisticated, system-level "memory" that captures enterprise-specific context and preferences. This approach is model-agnostic and more durable as base models improve, unlike fine-tuning which requires constant rework.
Even with large advertised context windows, LLMs show performance degradation and strange behaviors when overloaded. Described as "context anxiety," they may prematurely give up on complex tasks, claim imaginary time constraints, or oversimplify the problem, highlighting the gap between advertised and effective context sizes.
Simple, function-level evals are a "local optimization." Blitzy evaluates system changes by tasking them with completing large, real-world projects (e.g., modifying Apache Spark) and assessing the percentage of completion. This requires human "taste" to judge the gap between functional correctness and true user intent.
Static analysis isn't enough to understand a complex application. Blitzy's onboarding involves spinning up and running a parallel instance of the client's app. This process uncovers hidden runtime dependencies and behaviors, creating a far more accurate knowledge graph than code analysis alone could provide.
The traditional lever of `temperature` for controlling model creativity has been superseded in modern reasoning models, where it's often fixed. The new critical parameter is the "thinking budget"—the amount of reasoning tokens a model can use before responding. A larger budget allows for more internal review and higher-quality outputs.
The initial value from Blitzy isn't code generation, but fixing foundational issues. Its onboarding process creates a knowledge graph that improves documentation and test coverage. This provides immediate value by boosting the performance of all existing developer AI tools, like GitHub Copilot, even before writing new code.
Short-term, AI amplifies senior engineers who can validate its output. Long-term, as AI tools improve and coding becomes a commodity, the advantage will shift. Junior developers who are native to AI tooling and don't have to "unlearn" old habits will become highly valuable, especially given their lower cost.
