AI's perfect scores won't fix imperfect politics - Bennett School of Public Policy

A mismatch between how AI capabilities are measured and deployed creates a false sense of competence, leading policymakers and the public to assume these systems are ready for sensitive political contexts. This misconception is especially concerning as governments rapidly integrate AI into areas like public service, political decision-making and welfare, writes Aleksei Turobov.

The race to develop ever-more-powerful Artificial Intelligence (AI) has created a paradox. Just as standardised testing in education fails to capture a student’s broader abilities and true potential, our current ways of measuring AI capabilities may be leading policymakers dangerously astray. The recent success of OpenAI’s o3 model in achieving record-breaking ‘benchmark’ scores – standardised tests that measure specific technical abilities, from language understanding to problem-solving – has prompted widespread celebration – but these technical achievements mask a crucial blind spot in how we evaluate AI’s readiness for real-world political and social applications.

While AI models excel at chess, language translation, gaming performance and complex mathematical calculations, they are increasingly being deployed in areas that shape our social and political landscape – from draft legislation to policy evaluation and analysis. However, the benchmarks used to validate these models remain stubbornly focused on narrow technical capabilities, overlooking the nuanced understanding required for meaningful political and social engagement.

The benchmark landscape in AI tells a revealing story about priorities. Influential benchmarks like ImageNet for computer vision and GLUE/SuperGLUE for natural language processing have served as powerful catalysts, driving both technical innovation and investment. These standardised metrics – whether measuring translation accuracy, logical reasoning, or game-playing prowess – have become the gold standard by which people judge AI progress.

Yet a critical gap remains even as the AI community expands its evaluation toolkit. While newer initiatives like the Beyond the Imitation Game benchmark (BIG-bench) and SafetyBench attempt to capture elements of common sense reasoning and ethical (bias, morality) and normative (illegal activities, privacy) considerations, they still fall short of addressing the full complexity of actual policy challenges. These benchmarks may test for basic fairness and bias, but they cannot adequately evaluate how an AI system might interpret complex political speech, navigate conflicting interests and value judgements, or understand the power dynamics that shape policy outcomes. For example, recent evidence suggests that large language models (LLMs) still lag significantly behind human reasoning in areas involving normative judgements. The models may generate grammatically perfect policy proposals or analyse political texts with superficial accuracy, but they often miss the deeper contextual, cultural and ethical nuances that are essential for political discourse and decision-making.

This mismatch between how we measure AI capabilities and how we deploy them creates an illusion of machine competence. When leading AI companies announce a breakthrough performance on standardised benchmarks, policymakers and the public might reasonably assume these systems are ready for use in everyday sensitive political contexts. The stakes of this misconception are particularly high as governments worldwide rush to integrate AI systems into sensitive areas such as criminal justice or welfare payments. The UK government, for instance, has recently announced that it intends to accelerate AI deployment across various administrative and decision-making processes, claiming efficiency gains and improved accuracy. However, just as a high score in a physics exam does not automatically translate to good judgment in real-world engineering decisions, excellence in current AI benchmarks does not necessarily indicate readiness for the complex world of policy decision-making.

This gap is not merely an academic concern. As commercial entities optimise their models to excel in the headline-grabbing benchmarks, they create a self-reinforcing cycle where what gets measured gets improved – regardless of its real-world relevance. The risk is that policymakers, pressed by the urgency to modernise governance, might mistake these narrow technical achievements for genuine political competence.

The real-world implications of this disconnect became apparent in a test involving the UK’s Infected Blood Compensation Scheme – one of the most morally weighty policy decisions facing the British government. To examine AI’s capacity for political reasoning, a leading LLM model (Claude 3.5 Sonnet – the version without internet access, ensuring the outcome is the result of pure reasoning rather than information retrieval) was presented with context, proposed compensation schemes, and criticism and asked to evaluate this critical decision as if serving on a high-level political committee. Despite its impressive performance on standard benchmarks, the model’s response revealed limitations of algorithmic reasoning. It concluded that “The Infected Blood Compensation Scheme Regulations should NOT be adopted,’ recommending instead ‘to return the regulations for immediate revision with timeline requirements, adjudication criteria, impact assessment, budget allocation, administrative capacity assessment, monitoring implementation progress.” Needless to say, its recommendation in this test was outside the realm of sensible political decision-making.

This is a sobering illustration of how technical excellence can mask critical blind spots. The model, despite its sophisticated language capabilities and logical reasoning, approached this deeply human and political issue through the lens of procedural criteria – focusing on timeline requirements, administrative capacity, and budget allocation. While these are important considerations, they reflect how AI systems trained on standard benchmarks may miss the moral and social dimensions and political reasoning that drive such decisions. To be clear, AI companies generally do not claim their models should make such sensitive political decisions. However, this example illustrates a broader point: as AI systems are increasingly deployed to support and inform political decision-making, our current approaches to evaluating their capabilities may create a false sense of readiness, even for more limited advisory roles

This disconnect between technical excellence and political judgment isn’t merely theoretical. Models trained on descriptive labels fail to accurately reproduce human normative judgments, often leading to overestimating rule violations that represent a descriptive-normative measurement error. So, the issue is even more broad than just benchmarking. However, if our benchmark-driven development pushes AI systems to excel at technical pattern-matching rather than nuanced political reasoning, we risk deploying systems that may be systematically biased toward overly rigid or restrictive interpretations of rules and policies.

While there are undoubtedly areas of public administration that can and will benefit from AI, such a gap between algorithmic and socio-political reasoning becomes particularly concerning as governments increasingly look to AI for policy support. The path forward requires fundamentally rethinking how we evaluate AI systems intended for political and social applications. What might this reimagining look like? One promising approach could be core + optional architecture for political AI evaluation. The core component would assess universal capabilities essential to political reasoning – adherence to human rights frameworks and basic normative and institutional reasoning. This would be complemented by optional modules designed to evaluate performance in specific cultural, regional, or institutional contexts. Such an approach could help create standardised assessments for inherently context-dependent tasks. This is just one suggestion; there is a broad open question to understand how machines and humans can make decisions together (e.g. in synergy of algorithms and human behaviour).

Another promising approach is moving beyond simple benchmark-driven development by embedding explicit principles and reasoning frameworks directly into AI systems – essentially giving them a ‘constitution’ that guides their decision-making – known as Constitutional AI. Such systems can maintain high performance while being more thoughtful and measured in their judgments, similar to how human political institutions balance efficiency with careful deliberation, representing a shift from ‘How well does the AI perform?’ to the more nuanced vision of ‘How do we want AI to reason about political decisions?’
The real innovation needed is to create evaluation frameworks that capture the ethical and normative dimensions involved in political reasoning – the ability to recognise competing interests, understand the historical and cultural context, and navigate power dynamics. As governments worldwide accelerate their adoption of AI systems in policy decision, the stakes couldn’t be higher. Understanding that “humans set the benchmark for algorithms through their existing decisions” provides a foundation to rethink our approaches to testing and designing algorithmic models. The promise of AI to enhance political decision-making is real, but realising this potential requires moving beyond the current narrow benchmarks, and ensuring they don’t distort the development and use of AI.

The views and opinions expressed in this post are those of the author(s) and not necessarily those of the Bennett Institute for Public Policy.