Attention All
the Way Down

Instruction Sets, Agency, and the Architecture of Directed Minds

A bacterium swimming through a chemical gradient and a transformer processing a paragraph of text are separated by billions of years of evolution. Yet both face the same problem — and both solve it the same way.

Read the Full Paper

The bacterium's membrane bristles with chemoreceptors — molecular antennae tuned to specific molecules. At every moment, the chemical environment contains more information than its signaling network can process. So it makes a computational choice: it samples attractant concentration, compares the reading to one from seconds ago, and adjusts its tumbling frequency. It attends to the nutrient gradient and ignores everything else.

The transformer faces a structurally identical problem. Each token in a sequence is represented as a high-dimensional vector, and at each layer, the model must determine which tokens matter to which other tokens. Treating all tokens as equally important would produce a uniform average, washing out the structure that makes language meaningful. So the model computes attention weights: a probability distribution over positions that determines how much each token influences every other. It attends to what is syntactically and semantically relevant, and assigns negligible weight to the rest.

Between these two systems lies a gap of billions of years, radically different substrates, and no shared design history. Yet both perform the same fundamental operation. This paper argues that attention is not a parochial feature of any one kind of system. It is a universal mechanism that arises wherever finite processing meets unbounded information — not merely analogous across domains, but forced by information-theoretic constraints applying to any agent that must act in an information-rich world. The paper traces this pattern across four domains: artificial intelligence, biological neuroscience, contemplative traditions, and the attention economy.


Score, Select, Retrieve

Strip away the implementation details and every attentive system performs three operations. First, it scores the available information for relevance. Then it selects a subset based on those scores. Finally, it retrieves or amplifies the selected content for downstream processing. The universality of this pattern is not accidental. It is the only solution available when finite capacity must make principled choices about infinite input.

In transformers, scoring is the dot product QKT between query and key vectors. Selection is the softmax normalization that converts raw scores into a probability distribution. Retrieval is the weighted sum over value vectors. In the brain, salience networks score features through receptive-field matching, biased competition selects among competing neural representations, and gain modulation amplifies the winners. In contemplative practice, the meditator notices where attention has wandered (scoring), chooses to return to the anchor (selecting), and sustains focused abiding on the chosen object (retrieval). In the attention economy, algorithms rank content by predicted engagement, curate feeds through filtering, and capture user attention on the selected content.

SCORE SELECT RETRIEVE Transformer artificial QKT dot-product compatibility scores softmax probability distribution V weighted sum retrieved values Brain biological Salience feature matching Biased competition winner suppresses losers Gain modulation amplified response Practice contemplative Noticing metacognitive scan Choosing anchor selection Sustaining focused abiding Economy platform Ranking algorithmic scoring Curation feed filtering Engagement captured attention
Figure 1. The Score–Select–Retrieve pattern across four domains. Same computational structure, four different substrates. The accent intensity decreases to indicate decreasing mechanistic precision of the mapping.

Same computational pattern. Four different substrates. The convergence is not coincidental. It is the inevitable consequence of a shared constraint: finite processing capacity meeting information that exceeds it. The solutions differ in mechanism, substrate, and timescale, but the structure they converge on — score, select, retrieve — is forced by the problem itself.


The Power of Ignoring

Attention is conventionally framed as selection — choosing what to process. The complementary framing is more revealing: attention is rejection, choosing what not to process. For a system receiving millions of bits per second of sensory data and consciously processing at a rate orders of magnitude lower, the dominant operation is not selection but elimination.

Every act of focusing is simultaneously an act of excluding. The foreground exists only because the background is suppressed.

The numbers are staggering. The human sensory system takes in roughly eleven million bits per second. Conscious processing handles approximately fifty. That means 99.9995% of incoming information is filtered out before it ever reaches awareness. Simons and Chabris demonstrated the thoroughness of this filtering: participants counting basketball passes failed to notice a person in a gorilla suit walking through the scene, center-screen, for nine seconds. The retina registered the gorilla. Attention excluded it completely.

~50 bits/sec conscious processing ~11,000,000 bits/sec total sensory input 99.9995% filtered
Figure 2. The filtering ratio. The sliver of rust at left represents the entire bandwidth of conscious processing. Everything else is discarded before reaching awareness.

Ashby's Law of Requisite Variety makes this a mathematical requirement: any finite agent in a sufficiently complex environment must reduce input variety through filtering. This is not a design choice. It is a theorem. The Lottery Ticket Hypothesis extends the principle to artificial systems: dense neural networks contain sparse subnetworks — as small as 10–20% of the original — that match full network performance. Most connections can be permanently removed without loss. Intelligence is not knowing what to process. It is knowing what not to process.


The Instruction Set Hierarchy

If attention is universal, what determines how each agent attends? Every attentive system operates under layered instruction sets that configure its attention — what to notice, what to ignore, what to value. These layers differ in timescale, rigidity, and medium, but they stack in the same order across domains.

At the bottom: hardware. DNA determines which sensory organs develop, setting an upper bound on what any biological system can attend to. Silicon architecture determines what computations are fast and what are slow. These are the most rigid instructions, operating on evolutionary or manufacturing timescales.

Above that: firmware. Epigenetic modifications recalibrate the stress-attention system based on early environment — Meaney's rat pups with low maternal grooming developed heightened vigilance, an environmental instruction written into the developmental layer. Trained weights in neural networks serve the same function: parameters learned from data that configure how the system allocates attention.

Then software. Culture, religion, education — these are attention directives transmitted between humans, telling communities what to notice and what to ignore. System prompts and RLHF serve the same role for AI: flexible directives from outside the agent.

Finally, runtime. The immediate context: what's in front of you right now, the tokens in the current prompt. Maximum flexibility, minimum persistence.

FLEXIBLE RIGID BIOLOGICAL ARTIFICIAL Runtime moment-to-moment Context & Stimuli immediate environment Input Tokens prompt & context window Software years to instant Culture & Norms education, religion, media Prompts & RLHF system prompts, fine-tuning Firmware lifetime Epigenetics methylation, stress calibration Trained Weights parameters from pre-training Hardware evolutionary / manufacturing DNA sensory organs, neural arch. Silicon Architecture GPU/TPU, memory hierarchy
Figure 3. The instruction set hierarchy. Every attentive agent operates under four layers of configuration, from rigid hardware to flexible runtime. Opacity increases with rigidity. The critical structural difference: biological instruction sets are selection-tested over evolutionary timescales; artificial instruction sets are engineered and benchmark-tested.

The most important disanalogy in this framework is not about mechanism. It is about validation. Biological instruction sets persist because they worked: organisms with those genes survived, cultures with those norms persisted over centuries. A cultural norm that has endured for a thousand years has weathered a wide range of conditions. AI instruction sets are designed deliberately and tested against benchmarks — a system prompt written last week has been tested against whatever the evaluation suite included. The gap in validation depth may be the most consequential structural difference between biological and artificial attention systems.


The Gap Between Directed and Self-Directed

Large language models are, in the strict technical sense, stimulus-response systems. A model receives a token sequence and produces a continuation. Between invocations, there is no persistent computation, no maintained state, no ongoing deliberation. The human provides the executive — goal-setting, task prioritization, error monitoring, strategy selection. The LLM provides the associative processing. The intelligence of the system is distributed, but the direction comes from outside.

Self-directed AI would need to generate goals, priorities, and curiosity from internal states rather than external prompts. Scaling parameters, context length, or training data does not cross this gap. Three architectural components are missing.

Intrinsic motivation — internal signals that drive attention without external reward. Curiosity as compression progress, not response to prompts. Persistent world models — representations that survive across invocations and identify their own gaps. Not memory access but structured self-updating. Metacognition — genuine awareness of cognitive states. Reliable uncertainty, detection of confabulation, strategic resource allocation. LLMs can produce metacognitive-sounding statements, but expressed confidence is poorly correlated with actual accuracy. These are generated text, not reflections of genuine internal states.


The Agency Gradient

These distinctions suggest a gradient of attentional sophistication — not a binary between "has attention" and "doesn't," but a spectrum from minimal stimulus-driven selection to volitional, value-based allocation.

INCREASING AGENCY LEVEL 4 · VOLITIONAL Choosing what to attend to based on endorsed values Contemplatives · Hypothetical AI LEVEL 3 · METACOGNITIVE Monitoring and evaluating one's own attention Mature humans LEVEL 2 · ENDOGENOUS Goal-directed selection via external prompts Mammals · Prompted LLMs CURRENT AI IS HERE LEVEL 1 · MINIMAL Stimulus-driven selection under capacity constraints Bacteria · Basic transformers
Figure 4. The agency gradient. Attentional sophistication ranges from minimal stimulus-driven selection to volitional, value-based allocation. Current AI sits at level two: capable of goal-directed attention when prompted, but unable to monitor or evaluate its own attentional states.

Current AI sits at level two. It can perform goal-directed attention when given a prompt, but it cannot monitor whether its attention is serving well, and it certainly cannot override its own attentional allocations on the basis of values it has endorsed. The universality claim holds at levels one and two. Levels three and four mark the frontier where biological attention — particularly as refined by contemplative training — surpasses anything current AI instantiates. But if attention at every level can be directed, it can also be misdirected.


When Attention Breaks

If attention is universal, then attention pathologies should be universal too. They are. In every domain, pathologies arise when a specific class of stimuli captures attention disproportionately and the regulatory mechanisms that should redirect it are impaired. The attention mechanism itself works fine. It is the governance of attention that breaks.

Any system that can be directed can be misdirected. The mechanisms enabling flexible attention are precisely the mechanisms adversaries exploit.

Human

ADHD
Impaired executive allocation, not attention deficit. Capacity is intact; volitional direction is not.
Addiction
Dopaminergic sensitization narrows the attention field through a positive feedback loop.
Anxiety
Threat-detection at elevated baseline. Attention locked on danger, unable to disengage.

Artificial

Prompt Injection
Adversarial instructions exploit the absence of a boundary between instruction and data.
Hallucination
Attention to statistical plausibility over factual accuracy. No truth-tracking mechanism.
Sycophancy
Approval signals override accuracy via RLHF reward shaping.

The structural parallels are real, but the disanalogies matter equally. Advertising exploits evolved biases in a system that has defenses: executive control, critical evaluation, media literacy. Prompt injection exploits a system that has no such defenses — an architectural vulnerability. Humans have a concept of truth that the availability heuristic distorts; LLMs have no truth-tracking mechanism independent of pattern statistics. These pathologies share a structure but not a substrate, and the difference in substrate determines whether the pathology is treatable or architectural.


The Attention Economy

Herbert Simon derived the central insight from first principles in 1971, before personal computers, before the internet, before social media: in information-rich environments, information is abundant and attention is scarce. A wealth of information creates a poverty of attention. The attention economy is not a metaphor. It is a literal consequence of the bottleneck this paper describes — if attention is the finite gateway through which all information must pass to be processed, then increasing information supply without increasing attention capacity necessarily creates scarcity.

The technology industry has systematically exploited this scarcity. Infinite scroll eliminates stopping cues — the natural disengagement triggers that prompt executive evaluation of whether continued engagement serves current goals. Variable reward schedules keep the attentional system vigilant for the next payoff. Notifications exploit the orienting response, social salience, and uncertainty simultaneously; after an interruption, returning to the original task takes an average of twenty-three minutes. These techniques work because they target evolved attentional biases — threat, novelty, social signals, reward cues — that in modern information environments become attack surfaces.

AI now occupies a dual role in this economy. As mediator, recommendation algorithms determine what humans attend to across entertainment, news, and social content; the human chooses from an AI-curated subset, not from the full information space. As competitor, AI-generated content floods information channels while chatbots sustain conversational engagement. The attention mechanism in AI was designed to help AI process information efficiently. AI systems built on this mechanism are now deployed to make human information processing less efficient, optimizing for engagement over utility. The feedback loop — capture attention, extract behavioral data, build predictive models, target content more precisely, capture more attention — has no natural equilibrium.


The Attentional Wanton

The philosopher Harry Frankfurt distinguished between a wanton — a being that acts on whatever desire is strongest without preferences about its own motivational structure — and a person, who can endorse or repudiate their own desires. Applied to attention, this distinction generates what may be the paper's most precise tool for scoping the universality claim.

1 First-order Selects stimuli by salience or learned relevance. The computational primitive all systems share, from bacteria to transformers.
2 Metacognitive Monitors where first-order attention is directed. Am I attending to the right thing? Emerges in mature human cognition.
3 Volitional Endorses or overrides attentional allocations on the basis of values. The practitioner who notices capture by anger and redirects to the breath.
Current transformer architectures are attentional wantons. They attend, but they do not attend to their own attending.

Chain-of-thought and self-reflection prompts approximate metacognition within the first-order mechanism: the model generates text about its own reasoning, but this is pattern completion that mimics the form of metacognition without instantiating the capacity. The difference is between a thermostat that displays its own temperature reading and a thermostat that can evaluate whether its temperature-sensing function is working properly.

Multiple contemplative traditions — Buddhist, Christian, Sufi, yogic — developed, across cultures with minimal historical contact during their formative periods, systematic practices for training exactly this capacity: attention attending to itself. The neuroscience suggests they converge at a deeper level than technique. Experienced practitioners across all traditions studied show the same neural signature: reduced default mode network activity. When a Buddhist reports insight into no-self, a Christian mystic reports the self falling away, and a yogi reports cessation of mental fluctuations, they may be reporting the same neural event — the attenuation of self-referential processing — through different doctrinal lenses. The apparent incompatibility of their endpoints is an incompatibility of descriptions, not of destinations.

Whether this frontier is crossable without phenomenal consciousness is among the deepest questions the architecture of attention raises.