A common hiring mistake is prioritizing a conversational 'vibe check' over assessing actual skills. A much better approach is to give candidates a project that simulates the job's core responsibilities, providing a direct and clean signal of their capabilities.
AI models will quickly automate the majority of expert work, but they will struggle with the final, most complex 25%. For a long time, human expertise will be essential for this 'last mile,' making it the ultimate bottleneck and source of economic value.
The emerging job of training AI agents will be accessible to non-technical experts. The only critical skill will be leveraging deep domain knowledge to identify where a model makes a mistake, opening a new career path for most knowledge workers.
AI struggles with long-horizon tasks not just due to technical limits, but because we lack good ways to measure performance. Once effective evaluations (evals) for these capabilities exist, researchers can rapidly optimize models against them, accelerating progress significantly.
The correlation between dyslexia and entrepreneurship may be because the condition forces individuals to master delegation from a young age. This early development of a crucial leadership skill provides an advantage over competent peers who often learn it much later in their careers.
Data that measures success, like a grading rubric, is far more valuable for AI training than simple raw output. This 'second kind of data' enables iterative learning by allowing models to attempt a problem, receive a score, and learn from the feedback.
Knowledge work will shift from performing repetitive tasks to teaching AI agents how to do them. Workers will identify agent mistakes and turn them into reinforcement learning (RL) environments, creating a high-leverage, fixed-cost asset similar to software.
The most significant gap in AI research is its focus on academic evaluations instead of tasks customers value, like medical diagnosis or legal drafting. The solution is using real-world experts to define benchmarks that measure performance on economically relevant work.
To teach AI subjective skills like poetry, a group of experts with some disagreement is better than one with full consensus. This approach captures diverse tastes and edge cases, which is more valuable for creating a robust model than achieving perfect agreement.
AI performs poorly in areas where expertise is based on unwritten 'taste' or intuition rather than documented knowledge. If the correct approach doesn't exist in training data or isn't explicitly provided by human trainers, models will inevitably struggle with that particular problem.
Rather than creating assessments that prohibit AI use, hiring managers should embrace it. A candidate's ability to leverage tools like ChatGPT to complete a project is a more accurate predictor of their future impact than their ability to perform tasks without them.
To measure an AI model's economic value, survey domain experts on how they allocate their time across various tasks. This time-allocation data serves as a proxy for the economic weight of each task, against which the model's performance can be scored.
