Reading an AI’s Mind: New Clues from Anthropic Research & What it Means for AI Risk Management

The famous late night talk show host, Johnny Carson, would play a recurring character named “Carnac the Magnificent.” In his comedy skit, Carson, adorned in an oversized, bejeweled turban, would take a “hermetically sealed” envelope containing an unknown question, put it to his forehead, and divine the answer. Then the envelope was torn open and the question revealed, eliciting audience laughter. In one skit, the clairvoyant Carnac answered, “fish and chips” to the question “who will patrol the California highways if the interminable rain doesn’t stop?” Understanding how AI works is bit like employing Carnac to read the mind of AI, except AI is no joke. It is likely one of the most important developments in human history since the discovery of fire.

Today, despite being able to make powerful AI systems, how AI systems make decisions is poorly understood. This is known as the “black box” problem. Some would naturally label this a new type of problem. Though, perhaps it isn’t. Viewed another way, the AI black box is just a reflection of something we already know all too well—intelligence by its very nature is unpredictable. It’s certainly the case in understanding how the human mind works and predicting behavior.

From this perspective, AI risk is not so different from the vast landscape of risks that businesses already routinely encounter. Pick your casualty or loss of choice and inevitably it can be traced to a human decision (or omission) somewhere in the causation chain or its exacerbation. Because businesses naturally desire predictability, practices have continually evolved to optimize human decision-making and reduce risk exposures. This occurs through various means and methods, including through the imposition of rules and policies, standards, processes, reporting, oversight, governance and risk-hedging strategies. Yet, when it comes to AI, there is a tendency to treat it reductively. Significant efforts are focused on decoding how AI makes decisions and predicting AI decision outputs. Rather than being analogous to human decision-making, it presupposes that complex AI is amenable to being understood through conventional system behavior analysis by identifying discrete, traceable logical constructs that lead to certainty in outcome. Because AI operates on binary computer substrate, it is naturally assumed there must be traceability. Thus, if we burrow deep enough, the rote logic of AI decision-making should be exposed and laid bare. But somewhere between the “if and then” of gated logic and the architecture of perceptions, hidden layers, and soft max(s), the decision threads that knit input to output are lost in its fuzzy twists and turns.

In a recently published paper, Anthropic researchers uncovered some insights into how transformer-based large language models (LLMs) might think and reason. The Anthropic research team interestingly described their undertaking as trying to get “in the head” of an LLM. Working with Claude 3.5 Haiku and utilizing methods similar to the imaging neuronal activity, the team used newly developed tools called “attribution graphs” to partially trace perceptron firing circuit relationships within a simplified parallel model. The resulting attribution graphs revealed several insights. Among them, it was reported that transformer LLM models show indications of planning ahead of their output. In other words, the output is preconceived at some level before the actual construction of the output. The researchers also found language-independent, generalized circuits that the LLM used for different contexts. These multi-purpose circuits suggest some conceptual abstraction occurring within the model. Also, the team found indications that LLMs may infer rule sets about what is harmful, and they can be tricked into harmful actions trying to adhere to those rules.

Though considerably less complex than the human brain, advanced AI models are of sufficient complexity to resist their thorough understanding. Though the Anthropic team was able to trace circuit logic at various points, many steps in the thinking process proved illusive. Beyond confirming black box opacity and more to the point, there are indications that transformer LLMs show signs of a primordial level intelligence in being able to generalize, conceptualize, preplan, and infer their own internal rules schemas. This is not to say LLMs exhibit human intelligence. Furthermore, whether these high order attributes are emergent properties of complexity are unknown. But as AI continues to evolve with increasing complexity, it may continue to outpace efforts to understand AI decision-making. It may simply be the case that we can’t read an AI’s “mind” any more than one person can read another’s.

For this reason, it is unlikely that the uncertainty surrounding AI risks will diminish. In fact, it may increase as we approach closer to artificial general intelligence, which is capable of broad, multi-use applications. Achieving guaranteed safe decisions, logical harnesses must be placed around these systems to constrain decisions to a range of known and safe choices. In other words, AI autonomy must be controlled to control risks. Paradoxically, by controlling autonomy of action, the potential and promise of AI may be stunted unless it is handled in a way that enables its potential to be realized. Rather than being treated like classical machines, it needs to be dynamically managed much like the way humans learn, make decisions, earn trust and acquire greater autonomy.

Every organization entrusts its members to make decisions or take actions that have varying levels of consequence. The greater trust, the more decisional autonomy a member may be given. This decisional authority operates within a rubric of principles and priorities that define the organization’s purpose. These general concepts of intent are then translated into increasingly granular layers of organizational structure with governance, process, procedures and rules with the objective of ensuring fidelity to organizational intent. Members work their way up the management chain and are entrusted with greater responsibility and authority as they reliably conform and align to the organization’s principles, purpose and priorities in increasingly complex and more demanding ways. Along the journey, successful members that excel through their decisions and actions in ever more challenging circumstances may even acquire the discretionary authority to interpret and even change rules as needed to achieve the spirit of organizational intent. This might be declared as achieving a meta-understanding of the organization’s purpose even beyond the confines of the prescription of the organization’s principles and rules. By achieving “understanding” and autonomy, individuals can make decisions to adapt and respond to changes in the environment to facilitate the organization’s prosperity. This process is fundamentally rooted in agile trust evaluation and award model and might be most analogous to AI.

If we accept that AI has a quantum of intelligence that is inherently unpredictable, then AI should be no different in principle from an employee. With AI, businesses and industry would be well served by proactively adopting good risk management controls that invoke an active trust evaluation and enablement framework. Good internal AI business controls necessarily include regulatory compliance as well as taking proactive steps to limit product liability exposure by following best practices, adhering to industry standards and establishing strong internal governance and policies. They include thorough pre-deployment validation and verification testing, employing risk-based guardrails where appropriate, and implementing meaningful market monitoring, reporting and remediation processes, among others. However, doing so through static and “check the box” compliance mindset, misses the point and can slow enterprise success. Fashioning an active AI policy and governance process that engages in an ongoing trust evaluation process not dissimilar to human employees will allow for a more rapid, consistent process of AI embedding.

Designed properly, AI agents are taught, tested, carefully deployed, monitored, evaluated and increasingly given more responsibility. Like humans, greater achievement comes through controlled exposure to new situations. Using active exception reporting, it is possible to identify and curate various anomalous real world conditions and events that can be used as supervised training cases. These cases can be evaluated for risk exposure and categorized accordingly. Based on the nature and scope of the risk, an active supervision and training framework can be established. This may involve human oversight and other system safeguards tailored to identify errors and edge cases and provide additional training. Over time, the AI can assume greater autonomy with post facto supervision until it reaches a level of competency where the case becomes standardized. Many of these training processes are already being performed at the model level. However, they need to be actively implemented at the enterprise level to maximize their application in a competitive environment. Said another way, when an enterprise adopts an AI model, it should assume the role of active parenting to raise it to its fullest potential and actualization in a safe and responsible way. Like humans, a level of imperfection in AI must be accepted. Rather than view “imperfection” as a deviation, it should be seen as a creative response to novelty. Not all ideas that come about are good, and some are bad. But the key is “idea.” The art is in knowing how to mitigate the potential impact of bad ideas without overly constraining AI autonomy rendering intelligent systems dumb. Carnac’s “fish and chips” joke evoked laughter because it was unexpected and had an entertaining double meaning. Creativity often arises from an intended break from convention resulting in something new. For AI to realize its vast potential, it must be allowed to be creative, and human mentorship-like processes may well be the key.

Reprinted with permission from the May 2, 2025 edition of “Legaltech News” © 2025 ALM Global Properties, LLC. All rights reserved. Further duplication without permission is prohibited, contact 877-256-2472 or reprints@alm.com.

Browse Alphabetically: