Many people struggle to define what 'good' looks like. Building an evaluation (eval) for an AI system requires you to codify your quality standards, forcing a level of clarity and commitment that improves your own process and the AI's output.
If you struggle to see your work in terms of 'workflows,' try this: at the end of each day, tell an AI like Codex what you did. After a week, ask it to analyze the transcripts and suggest the most repetitive, time-consuming tasks to automate first.
AI agents can overcomplicate instructions and create 'AI sprop' (slop/propaganda). To combat this, build a dedicated 'skill editor' skill that runs on other skills to make them more concise, remove repetitive instructions, and maintain clarity in your automations.
AI agents struggle to reliably differentiate between nuanced scores like '3 out of 5' versus '4 out of 5.' For effective self-correction in automated workflows, structure your evaluations (evals) as a series of unambiguous, binary pass/fail checks.
The hardest part of AI automation is codifying what 'good' looks like. Creators possess deep knowledge of winning formulas for platforms like YouTube or LinkedIn. Businesses can hire them to create these evaluation systems (evals), rapidly up-leveling their in-house teams.
Like a product requirements document (PRD), an AI skill and its evaluation (eval) are never 'done.' As you use the system, you'll learn new things. Continually ask the AI to update its own instructions to build increasingly effective automations over time.
Constantly using AI for initial drafts can erode your ability to start from a blank page. Your brain's 'first-principles' problem-solving muscle weakens, and you risk becoming merely an editor of AI output rather than a true originator of ideas.
