While AI can inherit biases from training data, those datasets can be audited, benchmarked, and corrected. In contrast, uncovering and remedying the complex cognitive biases of a human judge is far more difficult and less systematic, making algorithmic fairness a potentially more solvable problem.
Even if jobs like judges are legally protected from direct AI replacement, they can be de facto automated. If every judge uses the same AI model for decision support, the outcome is systemic homogenization of judgment, creating a centralized point of failure without any formal automation.
When discussing AI risks like hallucinations, former Chief Justice McCormack argues the proper comparison isn't a perfect system, but the existing human one. Humans get tired, biased, and make mistakes. The question isn't whether AI is flawless, but whether it's an improvement over the error-prone reality.
Treating ethical considerations as a post-launch fix creates massive "technical debt" that is nearly impossible to resolve. Just as an AI trained to detect melanoma on one skin color fails on others, solutions built on biased data are fundamentally flawed. Ethics must be baked into the initial design and data gathering process.
Unlike a human judge, whose mental process is hidden, an AI dispute resolution system can be designed to provide a full audit trail. It can be required to 'show its work,' explaining its step-by-step reasoning, potentially offering more accountability than the current system allows.
The belief that simply 'hiring the best person' ensures fairness is flawed because human bias is unavoidable. A true merit-based system requires actively engineering bias out of processes through structured interviews, clear job descriptions, and intentionally sourcing from diverse talent pools.
A key challenge in AI adoption is not technological limitation but human over-reliance. 'Automation bias' occurs when people accept AI outputs without critical evaluation. This failure to scrutinize AI suggestions can lead to significant errors that a human check would have caught, making user training and verification processes essential.
National tests in Sweden revealed human evaluators for oral exams were shockingly inconsistent, sometimes performing worse than random chance. While AI grading has its own biases, they can be identified and systematically adjusted, unlike hidden human subjectivity.
Citing high rates of appellate court reversals and a 3-5% error rate in criminal convictions revealed by DNA, former Chief Justice McCormack argues the human-led justice system is not as reliable as perceived. This fallibility creates a clear opening for AI to improve accuracy and consistency.
A comprehensive approach to mitigating AI bias requires addressing three separate components. First, de-bias the training data before it's ingested. Second, audit and correct biases inherent in pre-trained models. Third, implement human-centered feedback loops during deployment to allow the system to self-correct based on real-world usage and outcomes.
Using an LLM to grade another's output is more reliable when the evaluation process is fundamentally different from the task itself. For agentic tasks, the performer uses tools like code interpreters, while the grader analyzes static outputs against criteria, reducing self-preference bias.