Traditional ML used "micro-batching" by normalizing inputs to the same size. LLMs break this model due to variable input/output lengths. The core innovation is continuous processing, handling one token at a time across all active requests, which creates complex scheduling and memory challenges solved by techniques like PagedAttention.
The "Bitter Lesson" is not just about using more compute, but leveraging it scalably. Current LLMs are inefficient because they only learn during a discrete training phase, not during deployment where most computation occurs. This reliance on a special, data-intensive training period is not a scalable use of computational resources.
Making an API usable for an LLM is a novel design challenge, analogous to creating an ergonomic SDK for a human developer. It's not just about technical implementation; it requires a deep understanding of how the model "thinks," which is a difficult new research area.
The current limitation of LLMs is their stateless nature; they reset with each new chat. The next major advancement will be models that can learn from interactions and accumulate skills over time, evolving from a static tool into a continuously improving digital colleague.
Unlike other LLMs that handle one deep research task at a time, Manus can run multiple searches in parallel. This allows a user to, for example, generate detailed reports on numerous distinct topics simultaneously, making it incredibly efficient for large-scale analysis.
Model architecture decisions directly impact inference performance. AI company Zyphra pre-selects target hardware and then chooses model parameters—such as a hidden dimension with many powers of two—to align with how GPUs split up workloads, maximizing efficiency from day one.
Contrary to the idea that infrastructure problems get commoditized, AI inference is growing more complex. This is driven by three factors: (1) increasing model scale (multi-trillion parameters), (2) greater diversity in model architectures and hardware, and (3) the shift to agentic systems that require managing long-lived, unpredictable state.
The binary distinction between "reasoning" and "non-reasoning" models is becoming obsolete. The more critical metric is now "token efficiency"—a model's ability to use more tokens only when a task's difficulty requires it. This dynamic token usage is a key differentiator for cost and performance.
Developing LLM applications requires solving for three infinite variables: how information is represented, which tools the model can access, and the prompt itself. This makes the process less like engineering and more like an art, where intuition guides you to a local maxima rather than a single optimal solution.
Setting an LLM's temperature to zero should make its output deterministic, but it doesn't in practice. This is because floating-point number additions, when parallelized across GPUs, are non-associative. The order in which batched operations complete creates tiny variations, preventing true determinism.
Unlike traditional computing where inputs were standardized, LLMs handle requests of varying lengths and produce outputs of non-deterministic duration. This unpredictability creates massive scheduling and memory management challenges on GPUs that were not designed for such chaotic, real-time workloads.