Hello and welcome to Eye on AI. In this edition…AI’s reliability problem…Trump sends an AI legislation blueprint to Congress…OpenAI consolidates products into a super app and hires up…AI agents that can improve how they improve…and does your AI model experience emotional distress?

Recommended Video

Like many of you, I’ve started playing around with AI agents. I often use them for research, where they work pretty well and save me substantial amounts of time. But so-called “deep research” agents have been available for over a year now, which makes them a relatively mature product in the AI world. I’ve also started trying the new crop of computer-using agents for other tasks. And here, my experience so far is that these agents are highly inconsistent.

For instance, Perplexity’s Computer, which is an agentic harness that works in a virtual machine with access to lots of tools, did a great job booking me a drop-off slot at my local recycling center. (It used Anthropic’s Claude Sonnet 4.6 as the underlying reasoning engine.) But when I asked it to investigate flight options for an upcoming business trip, it failed to complete the task—even though travel booking is one of those canonical use cases that the AI companies are always talking about. What the agent did do is eat up a lot of tokens over the course of 45 minutes of trying.

Last week, at an AI agent demo event Anthropic hosted for government and tech policy folks in London, I watched Claude Cowork initially struggle to run a fairly simple data-sorting exercise in an Excel spreadsheet, even as it later created a sophisticated budget forecasting model with seemingly no problems. I also watched Claude Code spin up a simple, text-based business strategy game I asked it to create that looked great on the surface, but whose underlying game logic didn’t make any sense.

Assessing AI agents’ reliability

Unreliability is a major drawback of current AI agents. It’s a point that Princeton University’s Sayash Kapoor and Arvind Narayanan, who cowrote the book AI Snakeoil and now cowrite the “AI As Normal Technology” blog, frequently make. And a few weeks ago they published a research paper, co-authored with four other computer scientists, that tries to think systematically about AI agent reliability and to benchmark leading AI models.

The paper, entitled “Towards a Science of AI Agent Reliability,” notes that most AI models are benchmarked on their average accuracy on tasks, a metric that allows for wildly unreliable performance. Instead, they look at reliability across four dimensions: consistency (if asked to perform the same task in the same way, do they always perform the same?); robustness (can they function even when conditions aren’t ideal?); calibration (do they give users an accurate sense of their certainty?); and safety (when they do mess up, how catastrophic are those mistakes likely to be?).

They further broke these four areas into 14 specific metrics and tested a number of models released in the 18 months prior to late November 2025 (so OpenAI’s GPT-5.2, Anthropic’s Claude Opus 4.5, and Google’s Gemini 3 Pro were the most advanced models tested). They tested the models on two different benchmark tests, one of which is a general benchmark for agentic tasks while the other simulates customer-support queries and tasks. They found that while reliability improved with each successive model release, it did not improve nearly as much as average accuracy figures. In fact, on the general agentic benchmark the rate of improvement in reliability was half that of accuracy, while on the customer service benchmark it was one-seventh!

Reliability metrics depend on the task at hand

Across the four areas of reliability the paper examined, Claude Opus 4.5 and Gemini 3 Pro scored the best, both with an overall reliability of 85%. But if you look at the 14 sub-metrics, there was still plenty of reason for concern. Gemini 3 Pro, for example, was poor judging when its answers were likely accurate, at just 52%, and terrible at avoiding potential catastrophic mistakes, at just 25%. Claude Opus 4.5 was the most consistent in its outcomes, but its score was still only 73% consistent. (I would urge you to check out and play around with the dashboard the researchers created to show the results across all the different metrics.)

Kapoor, Narayanan, and their co-authors are also sophisticated enough to know that reliability is not one-size-fits all metric. They note that if AI is being used to augment humans, as opposed to fully automating tasks, it might be ok for the AI to be less consistent and robust, since the human can act as a backstop. But “for automation, reliability is a hard prerequisite for deployment: an agent that succeeds on 90% of tasks but fails unpredictably on the remaining 10% may be a useful assistant yet an unacceptable autonomous system,” they write. They also note that different kinds of consistency matter in different settings. “Trajectory consistency

AI agents are getting more capable, but reliability is lagging—and that’s a problem