Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

Instead of building a generic graph compiler, Etched focused on hand-optimized kernels. This approach, similar to high-frequency trading firms, provides maximum performance. It's also future-proof, as they design their tools for AI models to use directly, anticipating a time when AI writes its own kernels.

Related Insights

The most in-demand skill at labs like Google DeepMind is low-level engineering for accelerating LLM runtime. This involves creating efficient, custom software artifacts (kernels) for new neural net architectures and serving techniques at scale.

Top-tier kernels like FlashAttention are co-designed with specific hardware (e.g., H100). This tight coupling makes waiting for future GPUs an impractical strategy. The competitive edge comes from maximizing the performance of available hardware now, even if it means rewriting kernels for each new generation.

Adding more FLOPS to current AI chips is useless due to thermal throttling. Etched realized the solution is lowering voltage, which quadratically reduces power consumption. Inspired by bitcoin miners, they created a new power delivery system enabling chips to run at under half the voltage of GPUs.

The pace of AI model improvement is faster than the ability to ship specific tools. By creating lower-level, generalizable tools, developers build a system that automatically becomes more powerful and adaptable as the underlying AI gets smarter, without requiring re-engineering.

OpenAI is designing its custom chip for flexibility, not just raw performance on current models. The team learned that major 100x efficiency gains come from evolving algorithms (e.g., dense to sparse transformers), so the hardware must be adaptable to these future architectural changes.

Fal maintains a performance edge by building a specialized just-in-time (JIT) compiler for diffusion models. This verticalized approach, inspired by PyTorch 2.0 but more focused, generates more efficient kernels than generalized tools, creating a defensible technical moat.

Instead of using high-level compilers like Triton, elite programmers design algorithms based on specific hardware properties (e.g., AMD's MI300X). This bottom-up approach ensures the code fully exploits the hardware's strengths, a level of control often lost through abstractions like Triton.

To build a truly AI-native engineering team, Artemis makes technical architecture decisions based on a primary question: will this choice increase or decrease the likelihood of AI tools generating correct answers? This optimizes the entire system for AI-assisted development and debugging.

Etched builds its own chips, boards, cold plates, interconnects, and even its own racks. This full-stack ownership allows for extreme parallelization and iteration speed, a key advantage over startups that rely on a fragmented supply chain and multiple vendors.

Instead of focusing on on-chip memory bandwidth, Etched optimized for cluster-scale memory. They built a custom interconnect that cuts chip-to-chip latency by over 5x compared to GPUs. This allows the memory of the entire cluster to function as a single, low-latency pool, dramatically improving performance.