Dropbox Switched from Go to Rust to Prevent Cascading Failures Caused by OOM Errors

Related Insights

Simple Systems Are Harder to Design Than Complex Ones, Prioritizing Validation Over Sophistication

Dropbox's former top engineer argues that designing for simplicity, validation, and understandability is more valuable long-term than creating intellectually complex systems. A simple system is more maintainable and its failure modes are easier to grasp, which is crucial for reliability.

Dropbox’s Former Most Senior Eng: Building Great Systems and Advice for the AI Era | James Cowling

The Peterman Pod·2 months ago

Production Async Systems Require Runtime Contracts, Not Ad-Hoc Promise Chains

Simple concurrency helpers or custom promise chains fail in production. Robust systems need a "runtime contract" that enforces strict rules like concurrency limits, retry policies with backoff, and automatic cancellation of related tasks. This ensures predictable behavior and prevents cascading failures.

The Hidden Cost of Promise.race in Production AI Workloads

Machine Learning Tech Brief By HackerNoon·2 months ago

Massive AI Scale Exposes Classic Systems Problems, Not Just Novel LLM Issues

When running AI inference at extreme scale, the most surprising and difficult challenges are often not unique to LLMs. Instead, they are classic distributed systems problems—like kernel panics caused by logging overload—that only manifest under immense load. The immaturity of runtimes exacerbates these issues.

Baseten CEO Tuhin Srivastava on the AI Inference Crunch, Custom Models, and Building the Inference Cloud

No Priors: Artificial Intelligence | Technology | Startups·2 months ago

Mozilla Created Rust to Solve C++'s Memory Safety and Concurrency Flaws

The creation of the Rust programming language was a direct response to fundamental weaknesses in C++. Mozilla needed a way to eliminate entire classes of security vulnerabilities (memory safety) and safely leverage multi-core processors (concurrency), which were intractable problems in its massive C++ codebase.

Mozilla Firefox CTO on Browser War Stories and the Path to Distinguished Engineer

The Peterman Pod·9 months ago

Horizontally Scalable Systems' Ultimate Bottleneck Is the Central Storage Layer

In systems like Kubernetes, most components like API servers and schedulers can be scaled out by adding more instances. The true bottleneck preventing an order-of-magnitude scale increase is the consistent storage layer (e.g., etcd). All major scaling efforts eventually focus on optimizing or replacing this single, critical component.

The Co-Creator of Kubernetes On Convincing Google, Building It, and Scaling for LLMs

The Peterman Pod·4 months ago

Revel Built a Crash-Proof Programming Language to Control Critical Hardware Systems

To mitigate the risk of expensive physical failures, hardware control software company Revel developed its own programming language. A core feature is that if code compiles successfully, it is guaranteed not to crash at runtime. This design choice eliminates a common source of catastrophic errors in hardware operation.

NVIDIA Earnings, Nano Banana 2, Block Cuts 20% of Workforce | Kenn Ricci, Howard Marks, Yash Patil, Scott Morton, Fan-Yun Sun, Adam Draper & Doug Bernauer, Sammy Azdoufal

TBPN·5 months ago

Dropbox Used AWS S3 as an Elastic "Escape Hatch" to Maximize Its Own Data Center Capacity

Instead of overprovisioning their physical hardware for worst-case scenarios, Dropbox built a system to temporarily offload writes to S3 when capacity got tight. This "escape hatch" allowed them to operate their own infrastructure with much higher utilization, providing the benefits of on-prem cost savings with cloud-like elasticity.

Dropbox’s Former Most Senior Eng: Building Great Systems and Advice for the AI Era | James Cowling

The Peterman Pod·2 months ago

Replace an Operating System's Guts With a Database for Superior Performance and Reliability

The DBOS project, co-founded by Stonebraker, argues operating systems primarily manage data at scale. Replacing core OS components (like the file system and scheduler) with a database engine can lead to faster performance, built-in high availability, and transactional guarantees for system operations, with "really no downside."

Turing Award Winner: Postgres, Disagreeing with Google, Future Problems | Mike Stonebraker

The Peterman Pod·3 months ago

In Massive GPU Clusters, the Probability of All Components Working is Zero; Design For Failure

When building systems with hundreds of thousands of GPUs and millions of components, it's a statistical certainty that something is always broken. Therefore, hardware and software must be architected from the ground up to handle constant, inevitable failures while maintaining performance and service availability.

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters

Training Data·9 months ago

Dropbox Beat AWS S3 on Cost by Optimizing for Its Specific Workload Patterns

While generally not recommended, Dropbox successfully migrated off S3 by building a storage system tailored to its unique file access patterns. This hyper-specialization, combined with using cutting-edge hardware, allowed them to achieve cost efficiencies that a general-purpose service like S3 couldn't match.

Dropbox’s Former Most Senior Eng: Building Great Systems and Advice for the AI Era | James Cowling

The Peterman Pod·2 months ago

Get your free personalized podcast brief

Related Insights