While prompt optimization is theoretically appealing, OpenPipe's team evaluated methods like JEPA and found they provided only minor boosts. Their RL fine-tuning methods delivered vastly superior results (96% vs 56% on a benchmark), suggesting weight updates still trump prompt engineering for complex tasks.
People struggle with AI prompts because the model lacks background on their goals and progress. The solution is 'Context Engineering': creating an environment where the AI continuously accumulates user-specific information, materials, and intent, reducing the need for constant prompt tweaking.
With models like Gemini 3, the key skill is shifting from crafting hyper-specific, constrained prompts to making ambitious, multi-faceted requests. Users trained on older models tend to pare down their asks, but the latest AIs are 'pent up with creative capability' and yield better results from bigger challenges.
Training AI agents to execute multi-step business workflows demands a new data paradigm. Companies create reinforcement learning (RL) environments—mini world models of business processes—where agents learn by attempting tasks, a more advanced method than simple prompt-completion training (SFT/RLHF).
Instead of manually refining a complex prompt, create a process where an AI agent evaluates its own output. By providing a framework for self-critique, including quantitative scores and qualitative reasoning, the AI can iteratively enhance its own system instructions and achieve a much stronger result.
Beyond supervised fine-tuning (SFT) and human feedback (RLHF), reinforcement learning (RL) in simulated environments is the next evolution. These "playgrounds" teach models to handle messy, multi-step, real-world tasks where current models often fail catastrophically.
OpenAI favors "zero gradient" prompt optimization because serving thousands of unique, fine-tuned model snapshots is operationally very difficult. Prompt-based adjustments allow performance gains without the immense infrastructure burden, making it a more practical and scalable approach for both OpenAI and developers.
The early focus on crafting the perfect prompt is obsolete. Sophisticated AI interaction is now about 'context engineering': architecting the entire environment by providing models with the right tools, data, and retrieval mechanisms to guide their reasoning process effectively.
Basic supervised fine-tuning (SFT) only adjusts a model's style. The real unlock for enterprises is reinforcement fine-tuning (RFT), which leverages proprietary datasets to create state-of-the-art models for specific, high-value tasks, moving beyond mere 'tone improvements.'
Fine-tuning an AI model is most effective when you use high-signal data. The best source for this is the set of difficult examples where your system consistently fails. The processes of error analysis and evaluation naturally curate this valuable dataset, making fine-tuning a logical and powerful next step after prompt engineering.
Good Star Labs found GPT-5's performance in their Diplomacy game skyrocketed with optimized prompts, moving it from the bottom to the top. This shows a model's inherent capability can be masked or revealed by its prompt, making "best model" a context-dependent title rather than an absolute one.