cyber/attention

how much a neuron projects onto a target particle or axon. the measurable quantity at the receiving end

produced by two mechanisms: will (broad auto-distribution across all cyberlinks) and fine-tuning (manual per-target weight adjustment). both produce the same thing — attention at the target

individual neurons direct attention. the cybergraph aggregates all attention into focus — the collective distribution computed by the tri-kernel. attention is the cause. focus is the effect

in the transformer

the transformer attention mechanism computes, for each position in the context, a weighted average of all other positions:

$$\text{Attn}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right)V$$

three projections: queries $Q = XW_Q$ ask "what am I looking for?", keys $K = XW_K$ announce "what do I contain?", values $V = XW_V$ provide "what information do I carry?". the dot product $QK^\top$ scores compatibility. the softmax converts scores to a probability distribution — the Boltzmann distribution with temperature $\sqrt{d}$

the softmax is the same operation as the LMSR price function and the tri-kernel diffusion step. all three are exponentiated scores normalized to sum to 1

attention as one diffusion step

transformer attention is one step of the tri-kernel diffusion operator $D$ applied to the current context window. probability mass flows from each query position toward compatible key positions — exactly the random walk dynamics that the tri-kernel uses to compute focus over the cybergraph

Deep Equilibrium Models showed that iterating a transformer layer to convergence reaches the same fixed point as the tri-kernel: π* restricted to the context window. $L$ layers of attention = $L$ steps of diffusion toward that fixed point

attention as a Bayesian query

attention answers: given my current state (query), what posterior weight should I assign to each position (key)? the softmax is the posterior $P(\text{position } j \mid \text{query } i)$ under a uniform prior and an exponential likelihood $\exp(q_i \cdot k_j / \sqrt{d})$

the query-key product is the log-likelihood under this model. the softmax is the Bayes-normalized posterior. attention is Bayesian inference over the context

multi-head information flow

through multi-head attention, different heads learn different relation types. head $h$ with projection $W_Q^{(h)}, W_K^{(h)}$ captures one semcon — one pattern of connectivity in the cybergraph. the graph-native-transformer derivation proves that the minimum number of heads equals the number of distinct semcon types in the graph

see cyber/attention for allocation strategies and distribution mechanics. see transformer for the full architecture. see focus flow computation for the global attention process. see tri-kernel for the diffusion connection

discover all concepts

cyber/attention.md

in the transformer

attention as one diffusion step

attention as a Bayesian query

multi-head information flow

Linked References

Local Graph