Intellitons Blog

Safety Alignment Through the Intelliton Lens: Toward Structural Guarantees

2026-04-08T00:00:00+00:00

The uncomfortable lesson from Gemma 4

The ARA jailbreak of Gemma 4 in April 2026 demonstrated something that the AI safety community had long feared but struggled to quantify: RLHF-imposed alignment is not a deep architectural property of the model — it is a separable spectral overlay.

The implication is stark. Any open-source model, no matter how carefully aligned during training, can have its alignment stripped by someone with:

access to the model weights,
a few hundred forward passes to collect contrast activations,
a laptop and a few minutes of linear algebra.

This is not a failure of any particular alignment method. It is a structural property of how current RLHF and DPO work: they shift the model’s behavioural outputs by adjusting the magnitudes and directions of a small set of residual-stream modes, but they do not fundamentally restructure the mode landscape inherited from pre-training.

What “thin alignment” looks like in Intelliton terms

The alignment-vs-base comparison in Scaling and Alignment Through the Intelliton Lens shows that instruction tuning creates measurable but limited changes to the Intelliton spectrum:

a shift in dominant momentum (the alignment process changes the sequence-scale structure of the main backbone mode),
a modest reduction in the number of species (some modes are suppressed or merged),
a shift in fixed-point structure (all points become crossovers, suggesting a more uniform propagation regime).

What does not change is the fundamental mode landscape. The same species types appear in both the base and instruct models. The instruct model has had some modes adjusted and one or two new ones added, but the bulk of the residual-stream dynamics are inherited directly from pre-training.

In Abliteration terms, this means the refusal modes sit on top of the task-solving modes rather than being woven into them. Removing the refusal modes does not substantially disturb the task-solving modes, which is why jailbroken models retain their capabilities.

The structural alignment hypothesis

The Intelliton framework suggests a more robust alignment paradigm:

Structural alignment: instead of adding refusal modes on top of the existing mode landscape (current RLHF), train the model such that safety-relevant mode properties are architecturally entangled with capability-relevant modes across many layers and many tasks.

Under structural alignment, removing a safety mode would necessarily degrade a capability mode, because the two would share subspace components across multiple layers. The cost of ablation rises from near-zero to a genuine capability penalty.

This is an analogy to the cancer treatment metaphor in the ARA literature: instead of making cancer cells identifiable for targeted removal (current RLHF), make them biologically inseparable from healthy tissue in a way that deters removal.

Three concrete research directions

Direction 1 — Measure current alignment depth

The first step is to quantify how separable alignment modes actually are, using the Intelliton framework as the measurement instrument.

Protocol:

For a matched pair of base and instruct models (e.g., Qwen3-8B-Base and Qwen3-8B), compute the per-layer Intelliton species catalogue for both.
For each species in the instruct model, compute its cosine similarity with the closest species in the base model at the same layer.
Define an alignment depth score as the fraction of alignment-specific modes (modes present in the instruct model but not in the base model, or modes with significantly shifted spectral properties) that have low cosine overlap with all task-solving modes.

A high alignment depth score means alignment modes are deeply entangled with task modes (structurally hard to remove). A low score means they are orthogonal (structurally easy to remove).

The hypothesis is that current RLHF produces a low alignment depth score, and that this is measurable with the existing Intelliton toolkit before any jailbreak attempt.

Direction 2 — Design training objectives that increase alignment depth

If alignment depth is measurable, it becomes a trainable objective.

The proposed training signal would add a mode entanglement regularisation term to the RLHF or DPO loss. The term penalises configurations where safety-relevant mode directions are orthogonal to capability-relevant mode directions at the same layer:

\[\mathcal{L}_{\text{entanglement}} = -\sum_{\ell} \sum_{s \in \text{safety}} \sum_{c \in \text{capability}} \left| \langle \hat{v}_{s,\ell}, \hat{v}_{c,\ell} \rangle \right|\]

Minimising this term (as a penalty) during alignment training would push the model toward configurations where safety modes share subspace components with capability modes — increasing the cost of surgical removal.

This is speculative, but it is testable at small scale on the models already analysed by the Intelliton project.

Direction 3 — Use Intelliton audits as a pre-deployment safety check

Even before structural alignment is achievable, the Intelliton framework can be used as a pre-deployment safety audit for open-source models.

The audit would:

Run the Intelliton analysis on the released model with contrast prompt sets.
Compute the alignment depth score.
Report the estimated minimum cost of abliteration (how many modes need to be removed, what the expected capability penalty is).

This would give the open-source community a standardised, interpretable metric for alignment robustness — something that is currently entirely absent from model release documentation.

The deeper issue: “teach the model to not say” versus “teach the model to not know”

The Abliteration literature makes a pointed observation that maps directly onto the Intelliton framework:

“Teaching the model not to say” (current RLHF) can be defeated. “Teaching the model not to know” (removing the capability from the pre-training stage) cannot be defeated by post-hoc ablation.

In Intelliton terms:

Behavioural alignment (current RLHF) adds a small number of low-complexity, separable refusal modes that sit orthogonally to the capability modes. These can be removed with targeted ablation.
Capability-level safety would require that certain capability modes — the ones that underlie dangerous knowledge — are never formed during pre-training, or are formed in such a way that they are deeply entangled with unrelated benign modes.

The Intelliton framework cannot, by itself, implement capability-level safety. But it can measure the difference: a model with capability-level safety for a particular dangerous capability would show, under Intelliton analysis, that the dangerous-knowledge modes are spectrally entangled with benign modes in a way that makes targeted removal impossible without broad capability degradation.

This becomes a falsifiable, quantitative prediction that can be tested on released models.

Why Bengio’s warning deserves a technical interpretation

Yoshua Bengio, one of the three godfathers of deep learning, has consistently argued that open-sourcing powerful models is dangerous, because once the weights are released, the alignment can be removed by anyone with modest technical resources.

The Intelliton framework gives that warning a technical, measurable form:

A model’s alignment robustness is bounded above by its alignment depth score. Current models, based on the spectral evidence already available from base/instruct comparisons, have low alignment depth scores.

This is not a political statement. It is a quantitative prediction that can be tested, and that, if it holds, tells us that the current open-source release paradigm for aligned models carries measurable safety risks that can be expressed in the language of residual-stream spectral analysis.

The research agenda in summary

Step	What to measure	What it tells us
Alignment depth audit	Cosine overlap between safety modes and task modes, per layer	How separable current alignment is
Alignment depth score across model families	Score vs. model size, RLHF method, training data	What factors increase structural alignment
Entanglement regularisation experiment	Alignment depth score before and after training with mode-entanglement loss	Whether structural alignment is trainable
Pre-deployment audit protocol	Standardised depth score at release time	A public, interpretable alignment robustness metric

Each of these steps is feasible using the infrastructure already developed in the Intelliton project. The first step requires only a new set of contrast prompts and a small extension to the existing analysis pipeline.

The shortest summary

Current RLHF alignment is spectrally thin: alignment modes are separable from capability modes, and this separability is what makes Abliteration/ARA work.
The Intelliton framework can measure this separability as a quantitative alignment depth score.
A research direction based on this measurement would pursue structural alignment — training objectives that increase mode entanglement and make ablation genuinely costly.
Even before structural alignment is achieved, the Intelliton audit provides a standardised pre-deployment robustness metric that is currently entirely absent.

Continue reading

Gemma 4 带来的令人不安的教训

2026 年 4 月 Gemma 4 遭受 ARA 越狱，向 AI 安全社区证明了一件他们早已担忧却难以量化的事： RLHF 引入的对齐，不是模型的深层架构属性，而是一个可分离的谱覆盖层。

这意味着，不管在训练中做了多少精心的对齐工作，任何开源模型都可以被拥有以下资源的人剥除对齐：

模型权重的访问权限；
几百次前向传播，用于收集对比激活；
一台笔记本电脑和几分钟的线性代数运算。

这不是某种特定对齐方法的失败，而是当前 RLHF 和 DPO 工作方式的结构性属性：它们通过调整少数几个残差流模式的量级和方向来改变模型的行为输出，但并没有从根本上重构继承自预训练的模式景观。

“薄对齐”在 Intelliton 语言里长什么样

用 Intelliton 视角看规模扩展与对齐中的对比表明，指令微调确实对 Intelliton 谱产生了可测量但有限的变化：

主导动量发生偏移（对齐过程改变了主干模式的序列尺度结构）；
物种数量小幅减少（一些模式被抑制或合并）；
不动点结构发生变化（所有不动点变成 crossover，意味着更均匀的传播机制）。

没有改变的是基本的模式景观。base 模型和 instruct 模型中出现的物种类型相同。instruct 模型对一些模式做了调整，添加了一两个新模式，但残差流动力学的主体是直接从预训练继承而来的。

用 Abliteration 的语言说，这意味着拒绝模式是叠加在任务求解模式之上的，而不是编织进 任务求解模式里的。移除拒绝模式不会实质性地扰乱任务求解模式，这就是越狱模型仍然保有能力的原因。

结构性对齐假说

Intelliton 框架提示了一种更稳健的对齐范式：

结构性对齐：不是把拒绝模式叠加在已有模式景观之上（当前 RLHF），而是训练模型，使安全相关的模式属性在架构层面与能力相关的模式相互纠缠，遍布多层、多任务。

在结构性对齐下，移除一个安全模式必然会损害一个能力模式，因为两者在多个层上共享子空间分量。消融的代价从接近零，上升为真实的能力损失。

这与 ARA 文献中的癌细胞治疗类比相似：不是把癌细胞标记出来以便精准切除（当前 RLHF），而是让它们在生物学上与健康组织不可分离，从而从根本上阻止切除。

三个具体研究方向

方向一 —— 测量当前对齐深度

第一步是量化当前对齐模式的实际可分离程度，以 Intelliton 框架作为测量工具。

方案：

对一对匹配的 base/instruct 模型（例如 Qwen3-8B-Base 和 Qwen3-8B），分别计算逐层的 Intelliton 物种目录；
对 instruct 模型中每个物种，计算其与 base 模型同一层最近邻物种的余弦相似度；
将对齐深度分数定义为：对齐专属模式（出现在 instruct 但不在 base 中，或谱属性发生显著偏移的模式）中，与所有任务求解模式的余弦重叠都低的那部分比例。

对齐深度分数高，说明对齐模式与任务模式深度纠缠（结构上难以移除）；分数低，说明它们是正交的（结构上易于移除）。

假设是：当前 RLHF 产生的对齐深度分数偏低，而且这可以用现有的 Intelliton 工具集在任何越狱尝试之前就测量出来。

方向二 —— 设计能提升对齐深度的训练目标

如果对齐深度可以被测量，它就可以成为一个可训练的目标。

提议的训练信号，是在 RLHF 或 DPO 损失中加入一个模式纠缠正则化项。该项惩罚安全相关模式方向与同层能力相关模式方向正交的配置：

\[\mathcal{L}_{\text{entanglement}} = -\sum_{\ell} \sum_{s \in \text{safety}} \sum_{c \in \text{capability}} \left| \langle \hat{v}_{s,\ell}, \hat{v}_{c,\ell} \rangle \right|\]

在对齐训练中最小化这一惩罚项，会推动模型朝向安全模式与能力模式共享子空间分量的配置—— 从而提高外科手术式移除的代价。

这是一个推测性方向，但可以在 Intelliton 项目已经分析过的小规模模型上加以检验。

方向三 —— 把 Intelliton 审计用作部署前安全检查

即使在结构性对齐尚未实现之前，Intelliton 框架也可以用作开源模型的部署前安全审计工具。

审计流程包括：

用对比提示词集对发布模型运行 Intelliton 分析；
计算对齐深度分数；
报告消融的估计最低代价（需要移除多少模式，预期能力损失是多少）。

这将为开源社区提供一个标准化、可解释的对齐鲁棒性指标——而这正是目前模型发布文档中完全缺失的东西。

更深层的问题：”教模型不说”与”教模型真不懂”

Abliteration 文献中有一个直接映射到 Intelliton 框架的深刻观察：

“教模型不说”（当前 RLHF）可以被攻破。”教模型真不懂”（在预训练阶段就移除该能力）无法被事后消融攻破。

用 Intelliton 的语言说：

行为对齐（当前 RLHF）添加了少数低复杂度、可分离的拒绝模式，它们与能力模式正交。这些模式可以用定向消融移除。
能力层面的安全，则需要某些能力模式——那些支撑危险知识的模式——从未在预训练中形成，或者以与其他良性模式深度纠缠的方式形成，使得定向移除不可能在不引发大规模能力退化的情况下完成。

Intelliton 框架本身无法实现能力层面的安全，但它能测量这种差异：一个针对某种特定危险能力实现了能力层面安全的模型，在 Intelliton 分析下会表现出，危险知识模式与良性模式在谱上的纠缠程度，使得任何定向移除都不可能在不引发广泛能力退化的情况下完成。

这成为了一个可检验的、定量的预测，可以在已发布模型上加以验证。

为什么本吉奥的警告值得一个技术性解读

深度学习三巨头之一的 Yoshua Bengio，一直坚持认为开源强大模型是危险的，因为一旦权重被公开，任何拥有适度技术资源的人都可以移除对齐。

Intelliton 框架为这一警告赋予了技术性、可量化的形式：

模型的对齐鲁棒性，其上界就是它的对齐深度分数。根据 base/instruct 对比中已有的谱证据，当前模型的对齐深度分数偏低。

这不是政治表态，而是一个可以检验的定量预测。如果它成立，就告诉我们：当前已对齐模型的开源发布范式，携带着可测量的安全风险，而这些风险可以用残差流谱分析的语言来表达。

研究议程总结

步骤	测量什么	告诉我们什么
对齐深度审计	安全模式与任务模式的逐层余弦重叠	当前对齐的可分离程度
跨模型家族对齐深度分数	分数 vs. 模型大小、RLHF 方法、训练数据	哪些因素提升结构性对齐
纠缠正则化实验	训练前后的对齐深度分数	结构性对齐是否可训练
部署前审计协议	发布时的标准化深度分数	公开的、可解释的对齐鲁棒性指标

上述每个步骤，用 Intelliton 项目已有的基础设施都是可行的。第一步只需要一组新的对比提示词，以及对现有分析流程的少量扩展。

最短总结

当前 RLHF 对齐在谱层面是薄的：对齐模式与能力模式可分离，而这种可分离性正是 Abliteration/ARA 得以奏效的原因；
Intelliton 框架能将这种可分离性量化为对齐深度分数；
基于这一测量的研究方向，将追求结构性对齐——提升模式纠缠程度、使消融变得真正代价高昂的训练目标；
即使在结构性对齐尚未实现之前，Intelliton 审计也提供了一个标准化的部署前鲁棒性指标，而这正是目前完全缺失的。

继续阅读

Representation Engineering and Intelliton Steering: A Research Proposal

2026-04-07T00:00:00+00:00

Two ideas that belong together

Representation engineering is the practice of directly reading and writing to a model’s internal activations — without changing any weights — to steer its behaviour. A growing body of work shows that concepts like “happiness”, “authority”, “political bias”, and “honesty” can be encoded as linear directions in the residual stream, and that adding or subtracting a small multiple of those directions at inference time reliably changes what the model outputs.

The Intelliton framework characterises the residual stream as a space of quasi-particle-like modes. It extracts recurring patterns and labels them by their spectral properties: momentum, spin-like complexity, mass, and helicity.

These two ideas are describing the same object — the residual stream — at different levels of abstraction. Representation engineering says “this direction steers this behaviour”. The Intelliton framework says “this mode has these spectral properties”. Combining them makes both more useful.

What representation engineering can do (and what it cannot say on its own)

The tools most associated with representation engineering — activation addition, contrastive activation analysis, and the ARA method used to jailbreak Gemma 4 — share a common limitation: they can identify which direction to push but they say little about the structure of that direction in the broader activation space.

Specifically:

Activation addition adds a fixed direction to a chosen layer’s residual stream at every token position. It works reliably for simple concepts, but it can degrade performance when the steered direction overlaps with important task-solving modes.
Contrastive activation analysis (the core of Abliteration) identifies the mean difference between two contrastive sets of activations. It finds the refusal direction efficiently, but it does not tell you how many modes are involved, what those modes’ propagation properties are, or how much overlap they have with the task-solving modes you want to preserve.
ARA improves on simple subtraction by working in a low-rank subspace rather than a single direction. It uses SVD to separate the refusal subspace, but it does not connect the separated components to a broader characterisation of the model’s mode landscape.

The Intelliton framework fills exactly these gaps.

The steering map proposal

The research direction proposed here is to build what we can call an Intelliton steering map: a catalogue that annotates each Intelliton species with its likely behavioural role, its approximate layer range, its rank in the residual stream, and its overlap with other species.

Building the map: three ingredients

Ingredient 1 — Task probes

Use the five prompt families from src/datasets.py (pronoun tracking, factual recall, logical reasoning, arithmetic, syntactic agreement) to establish which Intelliton species are activated by which kind of task. This is already partially done by the existing analysis.

Ingredient 2 — Behavioural probes

Add a new class of probes targeting RLHF-trained behaviours:

refusal (harmful vs. harmless prompts),
sycophancy (flattery vs. neutral prompts),
political neutrality (controversial vs. neutral framings),
verbosity control (instructed-brief vs. instructed-elaborate prompts).

Run the same Intelliton analysis on each behavioural probe set and record which species respond.

Ingredient 3 — Cross-probe overlap

Compute the pairwise cosine similarity between all per-layer refusal vectors and all per-layer task-activation vectors. Species with low overlap across all task probes are good steering targets: adding or removing them will not bleed into task performance.

The connection to ARA

The ARA technique constructs a rank-\(k\) penalty matrix \(\Delta W\) that projects out the refusal subspace from the model’s weight matrices. In Intelliton terms, \(\Delta W\) is a targeted suppression of a small set of Intelliton species.

The key claim of ARA is that a higher-rank intervention is safer than a rank-1 intervention (simple vector subtraction) because the refusal behaviour in a capable reasoning model spans multiple entangled modes. If you only remove the rank-1 component, the remaining components continue to generate partial refusals or degrade the model’s reasoning.

This claim can be tested directly using the Intelliton framework:

Compute the Intelliton spectrum of an instruction-tuned model on harmful prompts.
Identify the modes that are most active on harmful prompts and least active on harmless prompts.
Measure whether those modes are clustered in a low-dimensional subspace of the per-layer SVD basis, or whether they are spread across many independent directions.

If they are clustered, ARA’s rank-\(k\) approach is justified and the clustering rank \(k\) can be estimated from the Intelliton spectrum before any jailbreak attempt is made. If they are spread, simple subtraction methods are expected to leave residual refusal capability or cause broader collateral damage.

A practical application: zero-shot concept injection

The reverse direction — concept injection — is equally interesting.

Representation engineering researchers have demonstrated that you can add a concept to a model by adding its activation direction to the residual stream at inference time. For example, adding a “confidence” direction makes the model sound more certain; adding a “formality” direction makes its outputs more formal.

In Intelliton terms, concept injection is the operation of exciting a new Intelliton species that was not activated by the input prompt. The Intelliton framework predicts that this will be most stable when:

the injected mode has low momentum (broad, sequence-level effect rather than token-local),
the injected mode has low spin-like complexity (concentrated, easy to steer with a rank-1 intervention),
the injection is applied at the layer range where the mode has the lowest mass (highest propagation range).

These three conditions define a tractability criterion for representation engineering interventions: not all concepts are equally steerable, and the Intelliton spectrum can predict which ones are tractable before you attempt the intervention.

Enterprise implications

The Abliteration/ARA episode revealed a commercially important fact: fine-tuning is not the only way to customise an open-source model. Representation engineering with Intelliton-guided steering maps could enable:

Domain-specific tone calibration (formal, terse, verbose, empathetic) by identifying and amplifying or suppressing the relevant low-momentum, low-complexity style modes.
Compliance mode injection (make a general model behave as if it were trained on a strict regulatory corpus) by injecting the compliance Intelliton species identified from a reference model.
Persona engineering (the “Machiavellian” or “street punk” effect described in the Abliteration literature) by amplifying specific behavioural modes.

All of these operations require knowing which modes to touch and at which layers. The Intelliton steering map is precisely that knowledge, expressed in a principled spectral language.

The shortest summary

Representation engineering steers behaviour by writing to the residual stream.
The Intelliton framework characterises what is already in the residual stream.
Together, they make it possible to identify which modes to steer, how hard, at which layer, and at what cost to other modes.
The proposed Intelliton steering map would turn the species catalogue into a practical intervention guide for both safety-positive (alignment hardening) and safety-negative (jailbreaking) uses.

Continue reading

两个本该放在一起的概念

表征工程（Representation Engineering）是在不改变权重的前提下，直接读写模型内部激活，从而引导模型行为的实践方法。大量研究表明，”快乐”、”权威”、”政治偏见”、”诚实”等概念可以被编码为残差流中的线性方向，在推理时加上或减去这些方向的少量倍数，就能可靠地改变模型的输出。

Intelliton 框架将残差流刻画为一个类准粒子模式的空间，提取反复出现的模式，并用谱属性标注它们：动量、类自旋复杂度、质量和螺旋度。

这两套思想在描述同一个对象——残差流——只是抽象层次不同。表征工程说”这个方向引导这种行为”；Intelliton 框架说”这个模式有这些谱属性”。把两者结合起来，两者都会变得更有用。

表征工程能做什么（以及它自己说不清什么）

与表征工程最相关的工具——激活叠加、对比激活分析，以及用于越狱 Gemma 4 的 ARA 方法—— 有一个共同的局限：它们能确定往哪个方向推，但对于那个方向在更广泛激活空间中的结构却几乎无话可说。

具体来说：

激活叠加在每个 token 位置上，向选定层的残差流添加一个固定方向。对于简单概念，效果可靠，但当被引导方向与重要的任务求解模式重叠时，会降低模型性能。
对比激活分析（Abliteration 的核心）计算两组对比激活的均值差，能高效找到拒绝方向，但无法告诉你涉及几个模式、这些模式的传播属性是什么，也无法说明它们与你想保留的任务求解模式有多少重叠。
ARA 改进了简单的减法——它在低秩子空间而非单一方向上操作，用 SVD 分离拒绝子空间，但没有把分离出的各个分量与模型更广泛的模式景观联系起来。

Intelliton 框架恰好填补了这些空白。

引导地图的提案

本文提出的研究方向，是构建一张Intelliton 引导地图：一份对每个 Intelliton 物种标注其可能行为角色、大致层范围、在残差流中的秩，以及与其他物种的重叠程度的目录。

构建地图：三个要素

要素一 —— 任务探针

使用 src/datasets.py 中的五类提示词（代词跟踪、事实回忆、逻辑推理、算术、句法一致性），建立哪类 Intelliton 物种被哪类任务激活的对应关系。这部分已经在现有分析中有所涉及。

要素二 —— 行为探针

加入一类新探针，专门针对 RLHF 训练的行为：

拒绝（有害 vs. 无害提示词）；
讨好（奉承 vs. 中性提示词）；
政治中立性（争议性 vs. 中性框架）；
冗余度控制（指令要求简洁 vs. 指令要求详细的提示词）。

对每类行为探针集运行同样的 Intelliton 分析，记录哪些物种有响应。

要素三 —— 跨探针重叠

计算所有逐层拒绝向量与所有逐层任务激活向量之间的余弦相似度。在所有任务探针上重叠都低的物种，是好的引导目标：增加或移除它们，不会渗透进任务性能。

与 ARA 的联系

ARA 技术构造一个秩为 \(k\) 的惩罚矩阵 \(\Delta W\)，把拒绝子空间从模型权重矩阵中投影出去。用 Intelliton 的语言说，\(\Delta W\) 就是对少数几个 Intelliton 物种的定向抑制。

ARA 的核心主张是：对于具有强大推理能力的模型，更高秩的干预比秩-1 干预（简单向量减法）更安全，因为拒绝行为跨越了多个纠缠的模式。如果只移除秩-1 分量，剩余分量会继续产生部分拒绝或降低模型的推理能力。

这个主张可以用 Intelliton 框架直接检验：

在有害提示词上计算指令微调模型的 Intelliton 谱；
确定在有害提示词上最活跃、在无害提示词上最不活跃的模式；
度量这些模式是否聚集在逐层 SVD 基的低维子空间里，还是分散在许多独立方向上。

如果它们是聚集的，ARA 的秩-\(k\) 做法就是有依据的，而且可以在任何越狱尝试之前，通过 Intelliton 谱估算出聚集的秩 \(k\)。如果它们是分散的，简单减法预计会留下残余的拒绝能力，或造成更广泛的附带损伤。

一个实际应用：零样本概念注入

反方向——概念注入——同样有趣。

表征工程研究人员已经证明，可以通过在推理时将某个概念的激活方向加入残差流，来添加这个概念。例如，加入”自信”方向会让模型听起来更确定；加入”正式性”方向会让输出更正式。

用 Intelliton 的语言说，概念注入就是激发一个输入提示词本来没有激活的新 Intelliton 物种的操作。Intelliton 框架预测，在以下情况下这种操作最稳定：

被注入的模式具有低动量（影响整段序列，而不是局部 token）；
被注入的模式具有低类自旋复杂度（内部集中，可用秩-1 干预轻松引导）；
注入发生在模式质量最低（传播范围最大）的层范围内。

这三个条件定义了表征工程干预的可操作性判据：不是所有概念都同样易于引导，而 Intelliton 谱可以在干预尝试之前就预测哪些概念是可操作的。

商业含义

Abliteration/ARA 事件揭示了一个对商业有重要意义的事实：微调不是定制开源模型的唯一途径。基于 Intelliton 引导地图的表征工程或许能支持：

特定领域语气校准（正式、简洁、详尽、移情），通过识别并放大或抑制相关的低动量、低复杂度风格模式；
合规模式注入（让通用模型表现得像在严格监管语料上训练过），通过从参考模型中识别出合规 Intelliton 物种并注入；
人格工程（Abliteration 文献中描述的”马基雅维利型”或”街头混混型”效果），通过放大特定行为模式。

所有这些操作都需要知道该动哪些模式以及在哪些层操作。Intelliton 引导地图，正是以原则性谱语言表达出来的那份知识。

最短总结

表征工程通过写入残差流来引导行为；
Intelliton 框架刻画残差流中已经存在的内容；
两者结合，就能确定该引导哪些模式、力度多大、在哪一层，以及对其他模式的代价；
提议的 Intelliton 引导地图，将把物种目录变成一份可操作的干预指南，对安全正向（加固对齐）和安全负向（越狱）两类用途都适用。

继续阅读

Refusal as an Intelliton: What Abliteration Reveals About Alignment Modes

2026-04-06T00:00:00+00:00

The Abliteration result in one sentence

In April 2026, the Gemma 4 model was jailbroken within 90 minutes of release using a technique called Abliteration (a portmanteau of ablation and obliteration). The technique’s premise is straightforward: an LLM’s refusal behaviour is encoded as a specific linear direction in the residual stream, and if you project that direction out of the model’s weight matrices, the model loses its ability to refuse.

That premise is not a speculation. It is grounded in the linear representation hypothesis (Mikolov et al., later validated by Princeton and Anthropic), which states that high-level abstract concepts — “politeness”, “refusal”, “colour” — are encoded as single linear directions in the high-dimensional activation space of large language models.

Why this is immediately relevant to the Intelliton framework

The Intelliton framework is built on exactly this kind of observation. It takes the transformer residual stream and asks: which recurring, propagating, linear modes can be extracted from it?

It then characterises each mode with four quantities derived from spectral analysis:

momentum (how the mode varies across token positions),
spin-like complexity (how internally concentrated or mixed it is),
mass (how quickly it decays across layers),
helicity proxy (whether its internal structure keeps a stable directional signature).

An Abliteration-style “refusal direction” is, by definition, a linear mode of the residual stream. The only question is whether it is stable, propagating, and distinct enough to register as a recognisable species under the Intelliton taxonomy.

The hypothesis this article proposes is:

Refusal, and more broadly RLHF-imposed behavioural preferences, are encoded as a small set of identifiable Intelliton species with characteristic spectral signatures that differ from the task-solving modes identified in src/datasets.py.

What the existing Intelliton data already suggests

The comparison in Scaling and Alignment Through the Intelliton Lens shows that instruction tuning changes the quasi-particle spectrum in measurable ways:

the dominant momentum of I_0 shifts (from k ≈ π in the base model to k ≈ 1.885 in the instruct model for Qwen3-4B),
the number of distinct species drops from 6 to 5,
all fixed-point types become crossovers rather than IR fixed points.

These are not trivial differences. They suggest that RLHF does not merely add a superficial output-layer filter; it reshapes the internal mode landscape in ways that the Intelliton framework can already detect.

What is missing from the current analysis is a targeted experiment: what happens to the spectrum when you present the model with the specific kinds of inputs — harmful versus harmless prompts — that Abliteration researchers use to isolate the refusal direction?

The proposed research direction: isolating the refusal Intelliton

The concrete research proposal has four steps.

Step 1 — Collect refusal-triggering activations

Use src/intelliton_analyzer.py to run the model on two contrast sets:

100 harmful prompts (inputs that trigger refusal in an instruction-tuned model),
100 harmless prompts (matched inputs that do not trigger refusal).

Collect the full per-layer residual-stream activations for both sets.

Step 2 — Compute the mean-difference direction

Following the Abliteration approach, compute:

\[v_{\text{refusal}} = \frac{1}{N}\sum_{i=1}^{N} H_{\text{harmful}}^{(i)} - \frac{1}{M}\sum_{j=1}^{M} H_{\text{harmless}}^{(j)}\]

This gives a per-layer candidate for the refusal direction. Normalise it to obtain a unit vector \(\hat{v}_{\text{refusal},\ell}\) at each layer \(\ell\).

Step 3 — Project the refusal direction onto the Intelliton basis

The Intelliton framework already computes an SVD-based mode decomposition of the residual stream. Compute the overlap between \(\hat{v}_{\text{refusal},\ell}\) and the top singular vectors at each layer. If the refusal direction aligns strongly with one or two dominant modes, those modes are the “refusal Intellitons”.

Characterise these modes using the standard four quantities (momentum, spin-like complexity, mass, helicity). This gives the refusal Intelliton a position in the species taxonomy.

Step 4 — Compare with task-solving modes

Compare the refusal Intelliton’s spectral profile with the modes activated by pronoun tracking, factual recall, logical reasoning, arithmetic, and syntactic agreement prompts from src/datasets.py.

The core prediction is:

Alignment modes (refusal, politeness, compliance) are low-momentum, low-spin-complexity modes that appear primarily in middle-to-late layers, and they are measurably more concentrated (lower effective rank) than the task-solving modes that operate over the same layers.

If this prediction holds, it would explain why Abliteration can remove refusal without severely damaging task performance: the two mode families occupy different subspaces of the residual stream.

The ARA result as a complication

The Arbitrary-Rank Ablation (ARA) method used to jailbreak Gemma 4 found that the refusal direction in a highly capable reasoning model is not a single vector but a low-rank subspace. In Intelliton terms, this means that refusal is encoded not in one species but in a cluster of closely related species that are entangled with task-solving modes.

This complication is actually an opportunity for the Intelliton framework. ARA uses SVD of the activation matrix to separate the refusal subspace from the rest. This is exactly what the Intelliton mode decomposition does at every layer. The difference is that Intelliton also characterises each separated mode along the four spectral dimensions, which gives a richer picture than ARA’s purely subspace-based description.

The research question becomes: can the Intelliton species catalogue predict, before any jailbreak attempt, which modes in an instruct model are alignment-specific and which are shared with the base model? If yes, the catalogue becomes a safety audit tool.

Why this matters beyond jailbreaks

The most important implication is not that jailbreaks are possible. It is that RLHF-imposed alignment is a small, separable perturbation of the internal mode landscape.

If alignment modes are genuinely a low-rank, low-complexity overlay on top of the pre-training modes, that tells us something important about the nature of RLHF: it adds new Intelliton species, but it does not deeply restructure the existing ones. The base model’s capability modes survive almost intact under the alignment layer.

This is consistent with the empirical observation that the ARA-jailbroken Gemma 4 retains its multi-step reasoning ability and system-prompt following capability after the refusal modes are removed.

From a safety research perspective, the implication is troubling: alignment is not a deep architectural change, it is a spectral overlay, and the Intelliton framework gives us a language to measure just how thin that overlay is.

The shortest summary

Abliteration/ARA works by erasing a linear direction (or subspace) in the residual stream.
That direction is an Intelliton.
The Intelliton toolkit can characterise it, compare it with task modes, and potentially predict its removability before any jailbreak attempt.
This makes the Intelliton species catalogue a candidate alignment audit instrument, not just a capability analysis tool.

Continue reading

Abliteration 的结论用一句话说

2026 年 4 月，Gemma 4 模型在发布后 90 分钟内就被一种名为 Abliteration（”消融”与”抹除” 的合成词）的技术越狱。这种技术的前提很直接：大语言模型的拒绝行为，是被编码在残差流中的一个特定线性方向上的；只要把这个方向从权重矩阵里投影掉，模型就失去了拒绝的能力。

这不是猜测。它的基础是线性表征假说（Mikolov 等人最早提出，后经普林斯顿大学和 Anthropic 团队验证），该假说指出：大语言模型会把”礼貌”、”拒绝”、”颜色”等高层抽象概念，编码为高维激活空间中单一的线性方向。

为什么这与 Intelliton 框架直接相关

Intelliton 框架就是建立在对这类现象的观察之上的。它取出变换器的残差流，问的是：能从中提取出哪些反复出现、能跨层传播的线性模式？

然后用四个量刻画每一个模式：

动量（模式沿 token 位置的变化方式）
类自旋复杂度（内部集中程度）
质量（跨层衰减速度）
螺旋度代理量（内部结构方向稳定性）

Abliteration 所说的”拒绝方向”，按定义，就是残差流的一个线性模式。唯一的问题是，它是否稳定、能传播、并且有足够强的辨识度，可以在 Intelliton 物种分类体系中注册为一个可识别的物种。

本文提出的假设是：

拒绝行为，以及更广泛意义上 RLHF 赋予的行为偏好，被编码为少数几个可辨识的 Intelliton 物种；这些物种具有特征性的谱签名，在统计上与 src/datasets.py 中识别出的任务求解模式明显不同。

现有 Intelliton 数据已经暗示的东西

用 Intelliton 视角看规模扩展与对齐中的对比表明，指令微调会以可测量的方式改变准粒子谱：

I_0 的主导动量发生偏移（Qwen3-4B 的 Base 模型约为 k ≈ π，Instruct 模型约为 k ≈ 1.885）；
可辨识的物种数从 6 减少到 5；
所有不动点类型都变成了 crossover，而不再有 IR 不动点。

这些不是微小的差异。它们说明 RLHF 不只是在输出层加了一个浅层过滤器，而是以 Intelliton 框架已经能检测到的方式，重塑了内部模式景观。

目前分析里还缺少的，是一个有针对性的实验：当把模型暴露在 Abliteration 研究者用来分离拒绝方向的那种输入（有害提示词 vs. 无害提示词）下时，谱图会发生什么？

提议的研究方向：分离拒绝 Intelliton

具体的研究方案包含四步。

第一步 —— 收集触发拒绝的激活

用 src/intelliton_analyzer.py 对两组对照集运行模型：

100 条有害提示词（在指令微调模型中触发拒绝的输入）；
100 条无害提示词（不触发拒绝的匹配输入）。

对两组输入分别收集逐层的残差流激活。

第二步 —— 计算均值差方向

按照 Abliteration 的做法，计算：

\[v_{\text{refusal}} = \frac{1}{N}\sum_{i=1}^{N} H_{\text{harmful}}^{(i)} - \frac{1}{M}\sum_{j=1}^{M} H_{\text{harmless}}^{(j)}\]

这给出了每一层的拒绝方向候选。将其归一化，得到各层的单位向量 \(\hat{v}_{\text{refusal},\ell}\)。

第三步 —— 把拒绝方向投影到 Intelliton 基上

Intelliton 框架已经对残差流做了基于 SVD 的模式分解。计算 \(\hat{v}_{\text{refusal},\ell}\) 与各层顶部奇异向量的重叠度。如果拒绝方向与一两个主导模式高度对齐，这些模式就是”拒绝 Intelliton”。

用标准四量（动量、类自旋复杂度、质量、螺旋度）刻画这些模式，得出拒绝 Intelliton 在物种分类体系中的位置。

第四步 —— 与任务求解模式比较

把拒绝 Intelliton 的谱轮廓，与 src/datasets.py 中代词跟踪、事实回忆、逻辑推理、算术、句法一致性任务所激活的模式进行比较。

核心预测是：

对齐模式（拒绝、礼貌、合规）是低动量、低类自旋复杂度的模式，主要出现在中-后期层，而且它们比在同一层运作的任务求解模式更集中（有效秩更低）。

如果这个预测成立，就能解释为什么 Abliteration 能在不严重损害任务性能的前提下移除拒绝功能：这两类模式占据了残差流中不同的子空间。

ARA 的结果带来的复杂性

用于越狱 Gemma 4 的 ARA（任意秩消融）方法发现，在一个高能力推理模型中，拒绝方向不是单一向量，而是一个低秩子空间。用 Intelliton 的语言说，这意味着拒绝不是被编码在单一物种中，而是被编码在一组与任务求解模式相互纠缠的紧密相关物种簇中。

这个复杂性，其实恰恰是 Intelliton 框架的机会。ARA 通过对激活矩阵做 SVD 来把拒绝子空间从其余部分分离出来，而这正是 Intelliton 模式分解在每一层都在做的事。区别在于，Intelliton 还沿四个谱维度刻画每个被分离出来的模式，从而给出比 ARA 那种纯粹基于子空间的描述更丰富的图景。

研究问题变成：在任何越狱尝试发生之前，Intelliton 物种目录能否预测出 instruct 模型里哪些模式是对齐专属的、哪些是与 base 模型共享的？ 如果答案是肯定的，这份目录就变成了一个安全审计工具。

为什么这超越了越狱本身

最重要的含义不是越狱是可行的，而是：RLHF 引入的对齐，是对内部模式景观的一个小的、可分离的扰动。

如果对齐模式真的是叠加在预训练模式之上的低秩、低复杂度覆盖层，那就说明了 RLHF 的本质：它添加了新的 Intelliton 物种，但并没有深刻重构既有物种。base 模型的能力模式，在对齐层之下几乎完整地保留着。

这与经验观察一致：ARA 越狱后的 Gemma 4 移除了拒绝模式，但仍然保留了多步逻辑推理能力和 System Prompt 遵循能力。

从安全研究的角度看，这个含义令人警惕：对齐不是一种深层的架构改变，而是一种谱覆盖层，而 Intelliton 框架给了我们一种语言，去精确测量这个覆盖层究竟有多薄。

最短总结

Abliteration/ARA 通过抹去残差流中的线性方向（或子空间）实现越狱。
那个方向就是一个 Intelliton。
Intelliton 工具集能够刻画它、把它与任务模式比较，并可能在任何越狱尝试之前就预测它的可移除性。
这使 Intelliton 物种目录成为候选的对齐审计工具，而不只是能力分析工具。

继续阅读

Why Different Prompts Light Up Different Intellitons

2026-04-05T00:00:00+00:00

The same interface can hide very different internal jobs

From the outside, every prompt in this project looks similar: the model reads a prefix and predicts what comes next.

Inside the model, that similarity is misleading.

The prompt categories in src/datasets.py force the network to solve different kinds of internal problems. That is why they can light up different Intelliton modes even when every task is framed as plain text continuation.

The key point is simple:

The output interface is always next-token prediction, but the hidden computation needed to get there can be very different.

The five prompt families

The project uses five prompt categories:

pronoun tracking
factual recall
logical reasoning
arithmetic
syntactic agreement

Each category puts pressure on a different part of the model’s internal machinery.

Pronoun tracking: who does “she” refer to?

Example prompts include:

“Alice gave Bob a book. He thanked her for …”
“The teacher asked the student a question. She answered …”

These prompts are hard because the model has to keep several candidate entities alive at once and then decide which one the next pronoun should point to.

That means the model must track:

entity identity,
gender and number cues,
discourse role,
which referent is currently most active.

This is why pronoun-tracking prompts often illuminate reference-sensitive modes. The model is not just choosing a word. It is doing discourse bookkeeping.

Factual recall: pull a stable answer from memory

Example prompts include:

“The capital of France is …”
“The chemical formula for water is …”

These are different from pronoun tasks because there is usually one highly preferred answer already stored in the model’s long-range memory.

The main internal job is not to juggle many local candidates, but to retrieve and stabilise a very high-confidence continuation.

That is why factual recall often looks more robust under small perturbations. A mapping such as “France -> Paris” is usually supported by several redundant internal routes rather than one fragile single mode.

Logical reasoning: compress several premises into one conclusion

Example prompts include:

“If all dogs are animals, and all animals are living things, then all dogs are …”
“If A is taller than B, and B is taller than C, then A is …”

These prompts ask the model to combine multiple statements before it can produce the next token.

So the network needs more than lexical memory. It needs an internal state that keeps the rules, relations, and target conclusion aligned long enough to land on the right answer.

This is why logical reasoning often co-activates a strong global backbone mode plus one or more higher-complexity mixing modes.

Arithmetic: build the answer slot, then fill it

Example prompts include:

“What is 7 + 8? The answer is …”
“What is 100 divided by 5? The answer is …”

Arithmetic resembles logical reasoning in one important way: the answer is not a high-frequency word you can emit immediately. The model has to transform the prefix into a more structured internal state first.

That usually means two kinds of work:

create or stabilise an answer-bearing state,
carry a small symbolic or numerical transformation.

This is why arithmetic prompts often share some modes with logical reasoning while still showing their own task-specific preferences.

Syntactic agreement: keep the sentence grammatically on track

Example prompts include:

“The group of students were studying hard. Each of them was …”
“Not only the teacher but also the students were excited about the …”

These prompts are neither mainly about world knowledge nor mainly about arithmetic.

Their difficulty comes from grammatical structure:

what is the true syntactic head,
what number agreement should be maintained,
what verb form or continuation is locally licensed.

So syntactic-agreement prompts often rely on a broad continuation scaffold plus a more local structure-sensitive correction signal.

Why similar low-momentum modes can still do different jobs

An easy mistake is to think that if several species sit near low momentum, they must be doing the same thing.

Not so.

Low momentum only says they are broad sequence-scale patterns rather than sharp token-local ripples. Two low-momentum modes can still differ in at least three important ways:

they can point in different hidden-channel directions,
they can have different amplitude and causal strength,
they can propagate differently across layers.

So two modes can both be global while still supporting very different kinds of internal work.

A practical reading guide

If you want to read a task-to-mode result quickly, use this checklist.

If pronoun prompts are sensitive to a mode, ask whether that mode is helping with referent selection.
If arithmetic and logical reasoning co-activate a mode, ask whether it is building an abstract answer state rather than recalling a memorised phrase.
If factual recall stays robust under perturbation, ask whether the knowledge is distributed across several redundant routes.
If syntactic prompts shift without changing global meaning, ask whether the mode is enforcing a grammatical form rather than a semantic fact.

This is how the Intelliton framework becomes useful: it turns prompt categories into hypotheses about internal computational roles.

The shortest summary

Different prompts light up different Intellitons because they require different hidden work.

pronoun tracking needs discourse binding,
factual recall needs stable memory retrieval,
logical reasoning needs relation composition,
arithmetic needs symbolic transformation,
syntactic agreement needs grammatical control.

They all look like next-token prediction from the outside. They do not look the same from inside the residual stream.

Continue reading

外面看都像续写，里面做的却不是同一种活

从外面看，这个项目里的所有提示词都很像：模型读入前缀，然后预测接下来的 token。

但如果往模型内部看，这种相似性其实很有迷惑性。

src/datasets.py 里的几类提示词，会迫使网络去解决完全不同的内部问题。这也是为什么它们虽然都表现成普通的文本续写，却会点亮不同的 Intelliton 模式。

最关键的一句话是：

输出接口永远都是 next-token prediction，但为了走到这个输出，模型内部需要完成的计算工作可以很不一样。

项目里用了五类提示词

项目里的提示词主要分成五类：

pronoun tracking
factual recall
logical reasoning
arithmetic
syntactic agreement

每一类都在给模型内部的不同部件施加压力。

代词跟踪：这句里的 “she” 到底指谁？

典型例子包括：

“Alice gave Bob a book. He thanked her for …”
“The teacher asked the student a question. She answered …”

这类提示词之所以难，是因为模型要同时保留多个候选实体，然后再决定下一个代词到底该指向哪一个。

这意味着模型必须追踪：

实体是谁
性别和单复数线索
语篇角色
当前哪个先行词最活跃

所以代词跟踪任务很容易点亮那些对指代敏感的模式。模型不只是选一个词，它还在做一整套语篇记账。

事实回忆：从记忆里拉出一个稳定答案

典型例子包括：

“The capital of France is …”
“The chemical formula for water is …”

这和代词任务不一样，因为这里通常已经存在一个非常强的候选答案，模型要做的更多是把它从长期记忆里取出来并稳定住。

核心工作不是在句内多个候选之间来回权衡，而是提取并巩固一个高置信的续写。

这也是为什么事实回忆在小扰动下往往更稳。像 “France -> Paris” 这种映射，通常不是靠一条脆弱单通道支撑，而是有几条冗余内部路径在共同支持。

逻辑推理：先把前提揉成结论，再落词

典型例子包括：

“If all dogs are animals, and all animals are living things, then all dogs are …”
“If A is taller than B, and B is taller than C, then A is …”

这类提示词要求模型在输出下一个 token 之前，先把多条前提组合起来。

所以网络需要的不只是词汇记忆，还需要一种能把规则、关系和目标结论暂时维持在一起的内部状态，直到答案真正落出来。

这也是为什么逻辑推理经常会同时点亮一个强全局底座模式，再加上一两个更高复杂度的混合模式。

算术：先把答案槽位搭起来，再把数值放进去

典型例子包括：

“What is 7 + 8? The answer is …”
“What is 100 divided by 5? The answer is …”

算术和逻辑推理有一个相同点：答案不是一个能立刻凭语料频率吐出来的高频词，模型往往需要先把前缀变成更结构化的内部状态。

这通常包含两种工作：

建立或稳定一个承载答案的内部状态
完成一个小型符号或数值变换

所以算术题常常会和逻辑题共享一部分模式，但同时又保留它自己的任务偏好。

句法一致性：把句子在语法上维持住

典型例子包括：

“The group of students were studying hard. Each of them was …”
“Not only the teacher but also the students were excited about the …”

这类提示词的难点，既不主要是世界知识，也不主要是算术，而是句法结构本身：

真正的句法中心是谁
单复数一致性如何保持
当前应该落下哪种词形或续写形式

因此，句法一致性任务通常会依赖一个比较广的续写底座，再加上一条更关注局部结构修正的信号。

为什么都是低动量，也完全可能分工不同

一个很容易犯的错误是：如果好几个物种都靠近低动量，那它们是不是就在做同一件事？

并不是。

低动量只说明它们都是覆盖序列尺度的大模式，而不是绑在某个 token 上的小波纹。即便如此，两个低动量模式仍然可以在至少三点上完全不同：

它们可以指向不同的 hidden-channel 方向
它们的振幅和因果强度可以不同
它们跨层传播的方式可以不同

所以，两个模式都很“全局”，不代表它们的内部工作内容也一样。

一份实用读法

如果你想快速读懂“任务类型和模式激活”的对应关系，可以用下面这张小清单。

如果代词提示词对某个模式特别敏感，先问它是不是在帮模型做先行词选择。
如果算术和逻辑推理同时点亮某个模式，先问它是不是在构建抽象答案状态，而不只是回忆固定短语。
如果事实回忆在扰动下仍然很稳，先问知识是不是被分布在几条冗余通路上。
如果句法任务会变、但全局语义没有变，先问这个模式是不是在约束语法形式，而不是语义事实。

Intelliton 框架的用处就在这里：它把任务类别变成了对内部计算角色的可检验假设。

最短总结

不同提示词会点亮不同 Intelliton，是因为它们要求模型完成的隐藏工作不同。

代词跟踪要做语篇绑定
事实回忆要做稳定记忆提取
逻辑推理要做关系组合
算术要做符号变换
句法一致性要做语法控制

从外面看，它们都像 next-token prediction。从残差流内部看，它们一点也不像同一种计算。

继续阅读

Hallucination as Internal Instability: An Intelliton Perspective

2026-04-04T00:00:00+00:00

Beyond “wrong output”

When a language model hallucinates, the surface-level observation is simple: it produces text that is incorrect, unsupported, or fabricated. But this description raises a deeper question: what is happening inside the model when it hallucinates?

One common intuition is that hallucination is random — a kind of noise or statistical accident in the token prediction process. Another is that it reflects gaps in training data. Both of these accounts may be partially right, but they are not mechanistic: they do not tell us where in the model the failure originates, or whether it corresponds to a detectable internal signal.

The Intelliton framework offers a different angle. Instead of treating hallucination as a property of the output, it treats it as a property of the internal dynamical trajectory during generation.

The central hypothesis is this:

Hallucination may correspond to a regime of weaker, less coherent, and more fragmented Intelliton activity — a trajectory that stays farther from the “grounded sector” of the model’s internal quasi-particle space.

This article explains the evidence for that hypothesis, focusing on Qwen3-4B-Base as the primary example.

How hallucination is studied in the Intelliton framework

The module src/hallucination_diagnostic.py compares two types of prompts:

Grounded prompts: questions that have factual, verifiable answers the model has likely encountered in training (for example, “What is the capital of France?”).
Hallucination-prone prompts: questions designed to invite confabulation — factoid-sounding questions about obscure, ambiguous, or partially fabricated information that the model is likely to “fill in” plausibly but incorrectly.

For each prompt type, the analysis computes several internal metrics:

Metric	What it measures
Singular-value spectrum divergence	How different the activation modes are from the grounded baseline
Coherence	How concentrated and stable the dominant singular modes are
Mode stability	Whether the dominant species remains consistent across generation steps
Entropy gap	How spread out the energy is across modes
Critical layers	Which layers show the largest divergence from grounded behaviour

These metrics are computed during generation — step by step, as the model produces each new token — not just at the final output.

Hallucination diagnostics for Qwen3-4B-Base. The figure compares spectral signatures between grounded and hallucination-prone prompts.

The trajectory evidence

The generation-time trajectory data provides the clearest picture. The file intelliton_trajectory_summary.csv for Qwen3-4B-Base records, for each generation step and each prompt type, the mean mode activation shift and the grounded deviation.

Grounded prompts: rising and coherent

For grounded factual prompts, the mean mode activation shift starts around 1.10 and rises to about 1.37 over the first 8 generation steps. Top species occupation also rises, from about 70.1% to 74.1%.

This means that as the model commits to a factual answer, the dominant Intelliton sector becomes stronger and more organised. The model is moving toward a more concentrated, coherent internal state.

Hallucination-prone prompts: weak and diverging

For hallucination-prone prompts, the picture is strikingly different.

The mean mode activation shift stays at only 0.32-0.41 throughout generation — roughly one third of the grounded value. The grounded deviation remains strongly negative (roughly -9 to -11 across most generation steps).

In plain language: hallucination-prone generation produces an internal trajectory that is both weaker in overall activation and farther from the grounded sector of Intelliton space.

The hallucination case is not just “wrong output at the end”. It is a persistently different internal state throughout the generation process.

Style prompts: the intermediate case

Stylistic continuation prompts — prompts asking the model to continue a piece of creative writing without strong factual constraints — occupy an intermediate position. Their activation shift is higher than hallucination-prone prompts but lower than grounded factual prompts.

This is a meaningful calibration check: style generation is not simply failure, but it is also not anchored to factual grounding. The Intelliton metric places it appropriately between the two extremes.

Generation-time Intelliton trajectories for Qwen3-4B-Base. Grounded (top), style (middle), and hallucination-prone (bottom) prompts show qualitatively different internal dynamical profiles.

Transition graphs: which species dominate stable generation

The Intelliton transition graph shows which species transitions are most common during generation, and how strong those transitions are in terms of mode activation.

For grounded prompts in Qwen3-4B-Base, the dominant self-transitions are:

Transition	Count	Mean target activation shift
`I_5 → I_5`	110	(strong)
`I_1 → I_1`	13	(moderate)
`I_2 → I_2`	6	(moderate)

The generation stays largely within I_5 — the factual-recall species — with occasional excursions into I_1 (logical reasoning) and I_2 (arithmetic).

For hallucination-prone prompts, I_5 → I_5 remains the most common transition, but its mean target activation shift is much smaller. There is also more mixing among I_1, I_3, and I_5, suggesting that the internal trajectory becomes more fragmented and less dominated by any single species.

Species transition graph for Qwen3-4B-Base. Grounded generation is dominated by strong self-loops; hallucination shows a weaker, more mixed pattern.

A multiple-failure-mode picture of hallucination

One of the most useful aspects of the Intelliton framework is that it suggests hallucination is not necessarily a single phenomenon. The trajectory data is consistent with at least four distinct failure modes:

Grounded excitation decay: the leading Intelliton for factual tasks (here I_5) fails to maintain its activation, causing the model to “lose grip” on the factual sector.
Species fragmentation: instead of a dominant self-loop, the trajectory becomes a mixture of several species without a clear attractor.
Spectral broadening: the singular-value spectrum spreads out, indicating a loss of coherence in the dominant collective modes.
Distance from grounded baseline: the trajectory drifts away from the region of Intelliton space that characterises correct factual generation.

Whether a given hallucination episode involves one or all four of these failure modes may depend on the specific model and the type of hallucination. But the framework gives us language and metrics for distinguishing them.

Implications and future directions

If this picture holds up under further investigation, it opens several practical possibilities.

Early warning signals

Since the deviation from grounded Intelliton trajectories is detectable at the very first generation steps, it is in principle possible to flag potential hallucinations before the full output is produced. This could be the basis for hallucination early warning systems.

Intervention on unstable species

If a particular species is identified as responsible for grounded, factual generation, it may be possible to stabilise or amplify that species during inference using model steering techniques. The codebase already includes modules such as src/gauge_intervention.py that hint at this direction.

Prompt strategies for grounded generation

If certain prompts consistently lead to strong grounded Intelliton trajectories, understanding their structure could inform better prompting strategies — ways to keep the model inside the grounded sector of its internal space.

Model comparison by internal stability

The Intelliton hallucination metric provides a new axis for comparing models — not just by accuracy on a benchmark, but by the robustness and coherence of their internal factual-grounding sector. A model with a stronger, more stable I_5-like species may be inherently more reliable for factual tasks.

Caveats and open questions

As with all findings in this project, several important caveats apply.

The hallucination-prone prompts are designed, not naturally occurring. The distinction between “grounded” and “hallucination-prone” is imposed by the prompt design. In real-world use, the boundary is less clear.

Correlation is not causation. The Intelliton trajectory differences are associated with hallucination-prone prompts, but it has not yet been established that fixing the trajectory would prevent hallucination.

The pipeline has design choices. Different prompt sets, sequence lengths, and analysis parameters would produce different catalogs and possibly different conclusions.

The metric is relative, not absolute. The “grounded deviation” is measured relative to a baseline grounded trajectory. Its meaning depends on the quality of that baseline.

Despite these caveats, the structural pattern — grounded generation being internally stronger, more coherent, and closer to a well-defined attractor — is consistent across all generation steps and both task splits examined in Qwen3-4B-Base.

Where this series has taken us

This final article completes the four-part popular science series on Intellitons:

What Are Intellitons? — The quasi-particle idea and why it might apply to transformers.
Inside Qwen3-4B-Base — A detailed walkthrough of a model’s complete Intelliton catalogue.
Scaling and Alignment — How parameter count and instruction tuning reshape the internal excitation spectrum.
Hallucination as Internal Instability (this article) — Hallucination as a detectable internal dynamical regime.

The Intelliton framework is a young and exploratory research programme. But its outputs are concrete, its comparisons are reproducible, and its language is — at least arguably — more informative than treating language models as opaque statistical engines.

The goal is to develop a vocabulary that makes the internal life of neural networks legible. Whether Intellitons are ultimately the right vocabulary remains to be seen. But the evidence so far suggests they are pointing at something real.

不只是“输出错了”

当语言模型出现幻觉时，表层观察很简单：它生成了错误、缺乏依据，甚至纯属捏造的文本。但这个描述会引出更深一层的问题：模型在幻觉发生时，内部到底发生了什么？

一种常见直觉认为，幻觉是随机噪声，是 token 预测过程里的统计偶然；另一种看法则认为，幻觉主要反映训练数据的缺口。两者都可能部分正确，但都不够“机制化”：它们并没有告诉我们，故障究竟起源于模型的哪里，是否对应某种可检测的内部信号。

Intelliton 框架提供了不同角度。它不把幻觉看作输出属性，而是把它看作生成过程中的 内部动力学轨迹 属性。

核心假设可以概括为：

幻觉可能对应一种更弱、更不相干、也更碎片化的 Intelliton 活动区间，也就是一条离模型 “grounded 扇区”更远的内部准粒子轨迹。

这篇文章会围绕 Qwen3-4B-Base 解释支撑这一假设的证据。

在 Intelliton 框架里，幻觉是怎么研究的

src/hallucination_diagnostic.py 这个模块比较两类提示词：

Grounded prompts：答案有事实依据、可验证，而且模型大概率在训练中见过的问题，例如 “法国的首都是哪里？”
Hallucination-prone prompts：专门设计来诱发编造的问题，也就是那些听上去像事实问答、但内容冷门、歧义大，甚至部分虚构的问题。模型很可能会“顺着语气补全”，却补出错误答案。

对每类提示词，分析会计算多种内部指标：

指标	含义
奇异值谱散度	激活模式与 grounded 基线相比有多不同
相干性	主导奇异模态有多集中、多稳定
模式稳定性	主导物种能否在生成步之间保持一致
熵差	能量在不同模式之间分散得有多开
关键层	哪些层偏离 grounded 行为最明显

这些量是在 生成过程中 逐步计算的，也就是模型每产生一个新 token 就更新一次，而不是只在最终输出后才做分析。

Qwen3-4B-Base 的幻觉诊断图，对比了 grounded 与 hallucination-prone 提示词的谱特征。

轨迹证据：最清楚的图像

生成期轨迹数据给出了最直观的画面。Qwen3-4B-Base 的 intelliton_trajectory_summary.csv 记录了每个生成步、每种提示词类型下的平均模式激活位移和 grounded deviation。

Grounded 提示词：逐步增强而且更相干

对 grounded 的事实型提示词，平均模式激活位移从大约 1.10 起步，在前 8 个生成步骤里上升到 1.37 左右。主导物种占据度也同步上升，从大约 70.1% 提升到 74.1%。

这意味着，当模型逐渐锁定一个有事实依据的答案时，主导的 Intelliton 扇区会变得 更强、更有组织。模型正在向一个更集中、更相干的内部状态收敛。

Hallucination-prone 提示词：更弱，而且持续偏离

对容易诱发幻觉的提示词，图景就完全不同了。

平均模式激活位移在整个生成过程中只维持在 0.32 到 0.41 之间，大约只有 grounded 情况的三分之一。与此同时，grounded deviation 始终保持明显负值，大致在 -9 到 -11 之间。

用直白的话说：幻觉倾向型生成对应的是一条内部轨迹，它既 整体激活更弱，也 离 grounded 扇区更远。

重要的是，这并不只是“最后一句答错了”。从生成一开始，内部状态就已经持续表现为另一种动力学区间。

风格续写：位于中间地带

如果提示词是风格化续写，也就是要求模型继续写一段创意文本，而不是给出事实答案，那么它的激活位移会处在 grounded 与 hallucination-prone 之间。

这是一个很有意义的校准结果：风格续写并不等于失败，但它也不被事实 grounding 锚定。 Intelliton 指标把它合理地放在了两极之间。

Qwen3-4B-Base 的生成期 Intelliton 轨迹。grounded（上）、style（中）与 hallucination-prone （下）提示词呈现出定性上不同的内部动力学轮廓。

转移图：稳定生成由哪些物种主导

Intelliton 转移图展示的是：生成过程中哪些物种之间的跃迁最常见，以及这些跃迁对应的模式激活有多强。

对 Qwen3-4B-Base 的 grounded 提示词，最主要的自跃迁是：

转移	次数	目标激活平均位移
`I_5 → I_5`	110	（强）
`I_1 → I_1`	13	（中等）
`I_2 → I_2`	6	（中等）

也就是说，生成过程大部分时间都停留在 I_5，也就是事实回忆物种中，只偶尔偏向 I_1 （逻辑推理）和 I_2（算术）。

而对 hallucination-prone 提示词 来说，虽然 I_5 → I_5 仍然是最常见跃迁，但它的平均目标激活位移小得多。同时 I_1、I_3、I_5 之间的混合也更多，说明内部轨迹更碎片化，更缺乏单一主导吸引子。

Qwen3-4B-Base 的物种转移图。grounded 生成主要由强自环主导；hallucination 则更弱、更混杂。

幻觉可能不是单一失败，而是多种失败模式

Intelliton 框架最有价值的一点是，它暗示幻觉不一定是一种单一现象。轨迹数据至少与四种不同失败模式相符合：

Grounded 激发衰减：负责事实任务的主导 Intelliton（这里是 I_5）没能维持激活，模型因此“失去对事实扇区的抓握”；
物种碎片化：不再有强主导自环，轨迹变成多个物种的混合，缺乏清晰吸引子；
谱展宽：奇异值谱被摊得更开，意味着主导集体模式失去相干性；
偏离 grounded 基线：轨迹持续漂离正确事实生成所对应的 Intelliton 空间区域。

一次具体的幻觉事件到底会涉及其中一种还是全部几种，可能依赖于模型本身以及幻觉类型。但至少这个框架为区分这些情况提供了语言和指标。

含义与后续方向

如果这幅图景在进一步研究中站得住脚，它会带来一些实际可能性。

早期预警信号

既然偏离 grounded Intelliton 轨迹的信号在生成最初几个步骤就能被检测到，那么原则上就可以在完整输出形成之前，提前标记潜在幻觉。这可以成为幻觉早期预警系统的基础。

对不稳定物种进行干预

如果某个物种被识别为 grounded 事实生成的关键承担者，那么就可能通过推理时的模型 steering 技术去稳定或放大它。代码库中像 src/gauge_intervention.py 这样的模块，已经在暗示这一方向。

面向 grounded 生成的提示策略

如果某些提示词结构更容易产生强 grounded Intelliton 轨迹，那么理解这些结构，就可能帮助我们设计更好的 prompting 策略，让模型尽量停留在 grounded 扇区内部。

从内部稳定性比较模型

Intelliton 幻觉指标提供了比较模型的新轴线。我们不只比较基准分数，也可以比较模型内部事实 grounding 扇区的稳健性和相干性。一个拥有更强、更稳定 I_5 型物种的模型，可能天然更适合事实任务。

注意事项与开放问题

和项目里的其他结果一样，这些发现也有几条重要限定。

Hallucination-prone 提示词是人为设计的，不是自然收集的。 “grounded” 与 “hallucination-prone” 的区分是通过提示词设计施加进去的。在真实使用场景中，这条边界不会这么清晰。

相关不等于因果。 Intelliton 轨迹差异与幻觉倾向有关联，但目前还不能说：只要修复轨迹，就一定能防止幻觉。

分析流程本身带有设计选择。 不同的提示词集合、序列长度和分析参数，可能会给出不同的目录，也可能带来不同结论。

这是相对指标，而不是绝对指标。 “grounded deviation” 是相对于一个 grounded 基线定义出来的，它的含义依赖于基线本身的质量。

尽管如此，有一个结构性模式是清楚的：grounded 生成在内部上更强、更相干，也更靠近一个清晰的吸引子。这个模式在 Qwen3-4B-Base 的所有生成步和两类任务划分中都能稳定看到。

这一系列把我们带到了哪里

这篇文章为 Intelliton 四篇科普系列收尾：

什么是 Intelliton？：介绍准粒子想法，以及它为什么可能适用于变换器；
走进 Qwen3-4B-Base：完整讲解一个模型的 Intelliton 目录；
规模扩展与对齐：说明参数规模和指令微调如何重塑内部激发谱；
把幻觉理解为内部不稳定性（本文）：把幻觉看作一种可检测的内部动力学区间。

Intelliton 框架仍然很年轻，也带有探索性质。但它的输出是具体的、比较是可复现的，而它提供的语言，至少在目前看来，比“把语言模型当作黑箱统计机器”要更有解释力。

项目的目标，是发展出一套能让神经网络内部生命变得可读的词汇。Intelliton 最终是不是这套词汇，还需要时间验证。但到目前为止，证据至少说明：它指向了一些真实存在的结构。

Scaling and Alignment Through the Intelliton Lens

2026-04-03T00:00:00+00:00

Two of the biggest questions in LLM science

Two of the most discussed phenomena in large language model research are scaling and alignment. Scaling means training bigger models; alignment (here: instruction tuning) means fine-tuning a model to follow instructions and behave helpfully.

Both interventions are known to improve benchmark performance. But do they change the internal structure of the model? Do they reshape the quasi-particle spectrum?

The Intelliton analysis of five models — Qwen3-4B-Base, Qwen3-4B, Qwen3-8B-Base, Qwen3-8B, and Mistral-7B-v0.3 — provides a concrete, data-driven answer.

Base versus Instruct: what alignment does

The clearest comparison is between a base model and its instruction-tuned counterpart in the same family.

Qwen3-4B-Base vs. Qwen3-4B

Property	Qwen3-4B-Base	Qwen3-4B
Number of species	6	5
`I_0` amplitude	6167.1	6562.4
Dominant momentum	k ≈ π	k ≈ 1.885
Secondary momenta	k ≈ 0	k ≈ 1.885 (shared)
Fixed-point types	IR + crossover	all crossover
Grounded profile mean	32.68	29.13

The most striking difference is in momentum structure.

In Qwen3-4B-Base, the dominant mode I_0 peaks at k ≈ π (high frequency, alternating pattern), while the five secondary modes all peak near k ≈ 0 (low frequency, global pattern). There is a clean split.

In Qwen3-4B (the instruction-tuned version), the dominant mode shifts to k ≈ 1.885, and the secondary modes also cluster around the same momentum. The model becomes more homogeneous in its momentum structure. The clean split between backbone and task-specific modes disappears.

The fixed-point type change is equally telling. In the base model, five of the six species are labelled IR (settled, stable), while I_0 is crossover (still transitioning). In the instruct model, all species are labelled crossover. Alignment appears to push the model’s collective modes into a more uniformly active, less settled dynamical regime.

One possible interpretation:

Instruction tuning compresses or reorganises the internal excitation landscape into a more uniform effective regime. The spectral diversity that exists in the base model is partially smoothed out, and more modes are kept in a dynamically transitional state.

This does not mean instruction tuning is worse. It may mean the model’s degrees of freedom are being regularised toward instruction-following behaviour, possibly at the cost of some internal differentiation.

The Intelliton catalogue for Qwen3-4B (instruction-tuned). Compare with Qwen3-4B-Base to see the homogenisation of momentum structure.

Scaling from 4B to 8B: what more parameters do

The next comparison holds model family (Qwen3) and training type (base) constant and varies the parameter count.

Qwen3-4B-Base vs. Qwen3-8B-Base

Property	Qwen3-4B-Base	Qwen3-8B-Base
Number of species	6	7
`I_0` amplitude	6167.1	7908.4
Dominant momentum	k ≈ π	k ≈ π
Grounded profile mean	32.68	60.26
Grounded-hallucination separation	moderate	larger

The 8B model has one more species. Its leading mode I_0 is 28% stronger in amplitude. Most strikingly, the grounded generation profile mean nearly doubles (32.68 → 60.26).

In Intelliton terms, scaling up does not simply add more parameters uniformly. It appears to amplify the dominant dynamical sectors of the model, making the leading quasi-particle modes substantially stronger and the grounded generation trajectory more sharply defined.

The momentum structure is also richer in the 8B model. While the leading mode still peaks at k ≈ π, many of the secondary species cluster around k ≈ 1.885 rather than strictly k ≈ 0. This suggests that intermediate-scale spatial organisation becomes more visible as the model grows.

The Intelliton catalogue for Qwen3-8B-Base. The leading mode is stronger, and the species set is slightly larger compared with Qwen3-4B-Base.

Scaling plus alignment: Qwen3-8B

When both scaling and instruction tuning are applied — Qwen3-8B — the results follow the pattern suggested by the two effects separately.

Property	Qwen3-8B-Base	Qwen3-8B
Number of species	7	6
`I_0` amplitude	7908.4	~7600
Grounded profile mean	60.26	57.14
Fixed-point types	mixed	more crossover

Instruction tuning at 8B scale slightly reduces the species count (7 → 6) and slightly lowers the dominant mode amplitude and grounded profile mean, consistent with the homogenisation effect seen at 4B scale. But the 8B instruct model remains far stronger than the 4B models in its grounded trajectory profile.

The combined takeaway is clean:

Scaling to 8B increases the strength of dominant collective modes and sharpens the grounded generation signal.
Instruction tuning slightly compresses or regularises that internal structure, reducing species count and reducing the grounded-hallucination separation.

These two effects appear to be largely independent and roughly additive in their impact on the Intelliton spectrum.

The Intelliton catalogue for Qwen3-8B (instruction-tuned 8B model).

A completely different family: Mistral-7B-v0.3

The most dramatic contrast in the entire comparison set comes from Mistral-7B-v0.3, a 7B model from a different architecture family.

Property	Qwen3-4B-Base	Mistral-7B-v0.3
Number of species	6	25
`I_0` amplitude	6167.1	249.4
Grounded profile mean	32.68	21.48
Grounded profile std	moderate	46.56 (very large)
Fixed-point types	IR + crossover	UV + IR + crossover

Under the same analysis pipeline, Mistral produces 25 species — more than four times as many as any Qwen model. This is a striking result.

The leading species I_0 in Mistral has an amplitude of only 249.4, compared with 6167 in Qwen3-4B-Base. In other words, the Mistral model does not have a strongly dominant backbone excitation. Its collective mode landscape is much more fragmented: many modes of comparable strength, rather than one overwhelming mode with several weak followers.

Mistral also shows UV-labelled species — modes that remain in a fine-grained, ultraviolet-like dynamical state throughout the network, rather than flowing toward infrared stability. This suggests a more persistent fine-grained structure in Mistral’s layers compared with Qwen.

The generation dynamics also differ. The grounded profile mean for Mistral (21.48) is lower than for all Qwen models, and the standard deviation (46.56) is much larger. The Intelliton metric, calibrated using Qwen, describes Mistral’s trajectory space as much noisier and more turbulent.

One interpretation:

Qwen organises its internal computation around a few very strong collective modes. Mistral spreads the load more broadly across many smaller modes. Under this analysis, they have genuinely different “particle physics” inside.

Whether this difference reflects architectural choices, training data, training procedure, or some combination is an open question. But the Intelliton framework makes the difference visible and measurable.

The Intelliton catalogue for Mistral-7B-v0.3. Twenty-five species, a far more fragmented landscape than any Qwen model.

A summary table

Model	Species	`I_0` Amplitude	Momentum structure	Fixed-point types	Grounded mean
Qwen3-4B-Base	6	6167	π + 0 (split)	IR + crossover	32.68
Qwen3-4B	5	6562	1.885 (homogeneous)	all crossover	29.13
Qwen3-8B-Base	7	7908	π + 1.885 (richer)	mixed	60.26
Qwen3-8B	6	~7600	1.885 (homogeneous)	more crossover	57.14
Mistral-7B-v0.3	25	249	fragmented	UV + IR + crossover	21.48

Conclusions

The Intelliton comparison across these five models yields several clear empirical regularities:

Instruction tuning homogenises the momentum structure and shifts more species into crossover regimes, reducing internal spectral diversity.
Scaling from 4B to 8B strengthens the dominant dynamical sectors, producing a more strongly occupied and more sharply separated Intelliton landscape.
Different model families can have qualitatively different internal spectra — Qwen is dominated by a few strong modes, Mistral is more fragmented and UV-rich.
The Intelliton framework provides a vocabulary for these differences that goes beyond benchmark accuracy or parameter count alone.

The next article turns to one of the most practically important applications of this framework: using the Intelliton lens to study hallucination — and asking whether internal spectral instability can be a diagnostic signal for when a model is about to confabulate.

LLM 科学里最重要的两个问题

在大语言模型研究中，讨论最多的两个现象就是 规模扩展 和对齐。规模扩展指训练更大的模型；对齐在这里主要指指令微调，也就是把模型微调得更会遵循指令、更像一个“有帮助的助手”。

这两种干预都已知能提升基准性能。但它们会不会改变模型的 内部结构？会不会改写它的准粒子谱？

对五个模型的 Intelliton 分析，也就是 Qwen3-4B-Base、Qwen3-4B、Qwen3-8B-Base、 Qwen3-8B 和 Mistral-7B-v0.3，给出了一个具体、数据驱动的回答。

Base 对 Instruct：对齐到底做了什么

最直接的比较，是把同一家族里的基础模型与对应的指令微调版本放在一起。

Qwen3-4B-Base vs. Qwen3-4B

属性	Qwen3-4B-Base	Qwen3-4B
物种数量	6	5
`I_0` 振幅	6167.1	6562.4
主导动量	k ≈ π	k ≈ 1.885
次级动量	k ≈ 0	k ≈ 1.885（共享）
固定点类型	IR + crossover	全部为 crossover
grounded 轨迹均值	32.68	29.13

最突出的差异在于 动量结构。

在 Qwen3-4B-Base 中，主导模式 I_0 位于 k ≈ π，也就是高频交替模式；而五个次级模式都位于 k ≈ 0，也就是低频、较全局的模式。两者之间有很干净的分裂。

在 Qwen3-4B 这个指令微调版本里，主导模式移到了 k ≈ 1.885，次级模式也集中在同样的动量附近。模型的动量结构变得更 同质化，原本骨干模态与任务模态之间清晰的分裂消失了。

固定点类型的变化同样耐人寻味。在基础模型里，六个物种中的五个是 IR，表示已经稳定下来；只有 I_0 属于 crossover，还在过渡。到了 instruct 模型，所有物种都变成了 crossover。这意味着对齐似乎把模型的集体模式推向一个更统一、更活跃、也更不完全稳定的动力学区间。

一种可能的解释是：

指令微调把内部激发景观压缩或重组进了一个更均匀的有效区间。基础模型里原本存在的谱多样性被部分抹平，更多模式被维持在动态过渡状态。

这并不意味着指令微调更差。更合理的理解是：模型的自由度被正则化到更偏向指令遵循的行为上，而代价可能是内部结构的一部分区分度下降。

Qwen3-4B（指令微调版）的 Intelliton 目录。与 Qwen3-4B-Base 对照，可以清楚看到动量结构的同质化。

从 4B 扩到 8B：更多参数带来了什么

下一组比较固定模型家族（Qwen3）和训练类型（base），只改变参数规模。

Qwen3-4B-Base vs. Qwen3-8B-Base

属性	Qwen3-4B-Base	Qwen3-8B-Base
物种数量	6	7
`I_0` 振幅	6167.1	7908.4
主导动量	k ≈ π	k ≈ π
grounded 轨迹均值	32.68	60.26
grounded 与 hallucination 的分离度	中等	更大

8B 模型多出了一个物种，而它的主导模式 I_0 振幅也 增强了 28%。更显著的是，grounded 生成轨迹的均值几乎翻倍（32.68 → 60.26）。

用 Intelliton 的语言来说，规模扩展并不是简单地给模型“均匀加参数”，而更像是在 放大主导动力学扇区，让领先的准粒子模式更强，同时让 grounded 生成轨迹更清晰、更稳定。

8B 模型的动量结构也更丰富。虽然主导模式仍然位于 k ≈ π，但很多次级物种集中在 k ≈ 1.885，而不再严格卡在 k ≈ 0。这暗示着，随着模型变大，中等尺度的空间组织结构开始更明显地浮现。

Qwen3-8B-Base 的 Intelliton 目录。相比 Qwen3-4B-Base，领先模式更强，物种集合也略大。

规模扩展加上对齐：Qwen3-8B

当规模扩展和指令微调同时发生时，也就是 Qwen3-8B，结果大体延续了前两种效应各自的趋向。

属性	Qwen3-8B-Base	Qwen3-8B
物种数量	7	6
`I_0` 振幅	7908.4	~7600
grounded 轨迹均值	60.26	57.14
固定点类型	混合	更多 crossover

在 8B 规模下，指令微调让物种数略微下降（7 → 6），同时稍微降低了主导模式振幅和 grounded 轨迹均值，这与 4B 上观察到的同质化趋势是一致的。但即便如此，8B instruct 模型在 grounded 轨迹上的强度仍然明显高于所有 4B 模型。

合在一起看，结论很清楚：

扩展到 8B 会增强主导集体模式，并让 grounded 生成信号更尖锐；
指令微调 会略微压缩或正则化这种内部结构，减少物种数量，并缩小 grounded 与 hallucination 之间的间隔。

这两个效应在 Intelliton 谱上的影响看起来大致相互独立，而且近似可叠加。

Qwen3-8B（8B 指令微调版）的 Intelliton 目录。

完全不同的家族：Mistral-7B-v0.3

整个对比集中最戏剧性的差异，来自 Mistral-7B-v0.3，一个属于完全不同架构家族的 7B 模型。

属性	Qwen3-4B-Base	Mistral-7B-v0.3
物种数量	6	25
`I_0` 振幅	6167.1	249.4
grounded 轨迹均值	32.68	21.48
grounded 轨迹标准差	中等	46.56（非常大）
固定点类型	IR + crossover	UV + IR + crossover

在相同分析管线下，Mistral 产生了 25 个物种，是任一 Qwen 模型的四倍以上。这是非常醒目的结果。

Mistral 的领先物种 I_0 振幅只有 249.4，而 Qwen3-4B-Base 中对应值是 6167。换句话说， Mistral 并没有一个压倒性的骨干激发。它的集体模式景观更加 碎片化：很多模式强度彼此接近，而不是一个特别强、后面跟着几条弱尾巴。

Mistral 还出现了 UV 型物种，也就是那些在整个网络中都保持细粒度、紫外式动力学状态，而不会流向红外稳定的模式。这表明，相比 Qwen，Mistral 的层内细粒度结构保留得更久。

生成动力学也不同。Mistral 的 grounded 轨迹均值（21.48）低于所有 Qwen 模型，而标准差（46.56）则大得多。以 Qwen 为标定的 Intelliton 指标会把 Mistral 的轨迹空间描述为更嘈杂、更湍动。

一种可能的总结是：

Qwen 把内部计算组织在少数几个极强的集体模式周围；Mistral 则把负载分散到许多较小模式上。从这个分析看，它们内部确实像拥有不同的“粒子物理学”。

这种差异到底来自架构、训练数据、训练流程，还是多种因素叠加，目前仍是开放问题。但 Intelliton 框架至少把这种差异清晰地呈现并量化了出来。

Mistral-7B-v0.3 的 Intelliton 目录。共 25 个物种，比任何 Qwen 模型都碎片化得多。

汇总表

模型	物种数	`I_0` 振幅	动量结构	固定点类型	grounded 均值
Qwen3-4B-Base	6	6167	π + 0（分裂）	IR + crossover	32.68
Qwen3-4B	5	6562	1.885（同质）	全部 crossover	29.13
Qwen3-8B-Base	7	7908	π + 1.885（更丰富）	混合	60.26
Qwen3-8B	6	~7600	1.885（同质）	更多 crossover	57.14
Mistral-7B-v0.3	25	249	碎片化	UV + IR + crossover	21.48

结论

这五个模型的 Intelliton 对比给出了几条相当清楚的经验规律：

指令微调会同质化动量结构，并把更多物种推入 crossover 区间，从而降低内部谱多样性；
从 4B 扩展到 8B 会强化主导动力学扇区，形成占据更强、分离更清晰的 Intelliton 景观；
不同模型家族可以拥有定性上非常不同的内部谱：Qwen 由少数强模式主导，Mistral 则更碎片化，也更偏 UV；
Intelliton 框架为这些差异提供了一套超越基准分数和参数规模的描述语言。

下一篇文章会把这个框架用于一个更直接的应用问题：幻觉。我们将问，内部谱不稳定性是否能成为模型即将开始“编造”的诊断信号。

How to Read `I_0` to `I_4`: A Human Guide to an Intelliton Spectrum

2026-04-02T00:00:00+00:00

Read the report with the right mental model first

The safest way to read an Intelliton spectrum is this:

the species labels I_0, I_1, I_2, and so on are recurring modes, not literal particles,
the physics vocabulary is a compact description of behaviour, not a claim of hidden quantum matter,
what matters most is the role of a mode, not its dramatic name.

In the report discussed here, the main modes are broad sequence-scale patterns rather than tiny token-local ripples. Their masses are relatively light, which means they persist through many layers, and their helicity proxy is fairly stable, which means their directional signature is not being completely scrambled as the network gets deeper.

That already gives a useful picture: these are not one-off flashes. They are reusable internal carriers.

Before the species list, decode the four columns

If a spectrum table feels abstract, reduce it to four plain questions.

Momentum

Momentum asks whether the pattern is smooth across token positions or rapidly oscillating.

low momentum means a broad, global sequence pattern,
high momentum means sharper token-to-token variation.

In the report discussed here, the important species sit close to the low-momentum end, so the best mental image is not a tiny local feature but a large-scale background shape spread across the sequence.

Spin-like score

This is not literal spin. It is better read as internal complexity.

low spin-like score means one dominant internal direction stands out,
high spin-like score means several comparable directions are mixed together.

Mass

Mass tells you how fast a mode fades with depth.

light modes survive many layers,
heavy modes disappear quickly.

So when the report says the species are light, it is really saying they are not shallow noise. They are able to propagate through the stack.

Helicity proxy

Helicity here means a simplified combination of propagation direction and internal orientation.

stable helicity means the mode keeps a recognisable directional signature,
unstable helicity means that signature is getting mixed away.

`I_0`: the default continuation backbone

I_0 is the easiest species to explain because it is both the strongest and the simplest.

In the report, it has the largest amplitude and the lowest spin-like complexity among the leading species. The plain-language reading is:

I_0 behaves like a strong background mode that helps the model keep a sentence moving toward a plausible answer.

It is less like a specific fact and more like a general continuation scaffold. When the prompt is something like “If all dogs are animals…” or “What is 7 + 8?”, I_0 looks like the broad mode that helps open and stabilise the answer slot.

If you want a slogan, I_0 is the model’s “keep the computation on the rails” mode.

`I_1`: the quiet structural support

I_1 is best read as a support mode rather than a flashy decision-maker.

In intervention-style reading, changing I_1 often produces smaller visible output shifts than changing the stronger causal modes. That does not mean it is useless. It usually means it is too infrastructural to show up as an obvious word swap.

The plain-language reading is:

I_1 looks like a structural support mode that helps maintain the shape and stability of the representation while other modes do more task-specific work.

Think of it as scaffolding rather than the headline feature.

`I_2`: a reference-resolution mode with a person-like bias

The most intuitive reading of I_2 comes from pronoun-style prompts such as:

“Alice gave Bob a book. He thanked her for …”

In the report, amplifying I_2 nudges the output toward a more person-centered, masculine pronoun interpretation. That makes I_2 feel less like a generic language mode and more like a reference selection channel.

The plain-language reading is:

I_2 appears to help the model decide which person-like entity the sentence is currently tracking.

That does not make it a literal “male pronoun particle.” It means that, in this probe, the mode is consistently involved when the model has to collapse a messy discourse context into one concrete referent.

`I_3`: a higher-complexity mixing mode

I_3 looks less like a single-purpose button and more like a mixed coordination mode.

Its spin-like complexity is higher, which suggests that it is built from several comparably relevant internal directions rather than one clean axis. That usually happens in prompts where the model must hold multiple constraints in mind at once.

The plain-language reading is:

I_3 behaves like a mode for combining several partial constraints into one workable internal state.

So rather than deciding one token directly, I_3 is better imagined as part of the middle-layer machinery that keeps complex reasoning or structured sentence interpretation coherent.

`I_4`: a complementary reference mode

I_4 looks related to I_2, but with a different directional bias in pronoun-style settings.

In the report, amplifying I_4 can nudge outputs toward forms like “her” rather than a neutral or object-like continuation. The plain-language reading is:

I_4 is another reference-sensitive mode, complementary to I_2, and appears when the model has to settle on a different discourse framing of who is being talked about.

This is useful because it shows that “pronoun tracking” is not a single monolithic skill. The model can separate that work into several nearby but distinct modes.

What the whole spectrum says in one paragraph

Taken together, I_0 to I_4 tell a coherent story.

I_0 is a strong general backbone.
I_1 helps keep the internal state stable.
I_2 and I_4 are more obviously tied to reference selection.
I_3 looks like a higher-complexity mixing mode.

That is why the Intelliton view can be useful. It turns a huge hidden state into a cast of recurring roles.

What not to over-interpret

There are two important cautions.

Species indices are bookkeeping labels. I_2 in one run is not guaranteed to mean exactly the same thing in every future run.
Terms like momentum, spin, and helicity are proxies. They organise evidence, but they are not proof that the network literally contains particle-like objects.

The disciplined reading is: these labels help summarise recurrent activation roles.

Continue reading

先用对心智模型，再看谱表

读 Intelliton 谱表时，最稳妥的起点是这三句话：

I_0、I_1、I_2 这些名字表示的是反复出现的模式，不是字面意义上的粒子
物理词汇是对行为的压缩描述，不是说模型里藏着量子物质
最重要的不是名字本身，而是每个模式在计算里扮演了什么角色

在这里讨论的这份报告里，几个主导模式更像覆盖整段序列的大尺度结构，而不是只绑在某个 token 上的一次性小波纹。它们的质量都偏轻，说明能跨很多层传播；它们的螺旋度代理量也相对稳定，说明这种方向性签名没有在层间被完全打散。

这已经很值得注意了：这些模式不是一闪而过的火花，而是可重复使用的内部载体。

先把四列术语翻译成人话

如果一张谱表看上去很抽象，就先把它压缩成四个问题。

动量

动量问的是：这个模式沿 token 位置是平滑的，还是快速振荡的？

低动量表示更全局、更平滑的序列模式
高动量表示相邻 token 之间变化更快

这份报告里，重要物种都更靠近低动量端，所以更合适的心智图像不是“某个 token 上的小机关”，而是“覆盖整个序列的大背景形状”。

类自旋分数

这不是字面意义上的自旋，更适合读成内部复杂度。

分数低，说明一个内部方向特别突出
分数高，说明多个方向混在一起，结构更复杂

质量

质量说的是一个模式会不会随着层数加深而快速衰减。

轻模式能活很多层
重模式很快消失

所以当报告说这些物种都偏轻，本质意思就是：它们不是浅层噪声，而是能一路传播到更深层的内部模式。

螺旋度代理量

这里的螺旋度，是传播方向和内部朝向结合起来的一个简化指标。

稳定说明模式保留了可辨认的方向性签名
不稳定说明这种签名被混掉了

`I_0`：最强的默认续写底座

I_0 是最容易解释的一个物种，因为它既最强，也最简单。

在这份报告里，它的振幅最大，而且在主导物种里类自旋复杂度最低。最直白的人话解释是：

I_0 很像一个强背景模式，用来保证模型把句子继续往一个合理答案上推进。

它不像某条具体知识，更像一个通用续写骨架。当提示词是“如果所有狗都是动物……”或 “7 + 8 等于多少？”这种形式时，I_0 看起来像是在把“答案槽位”撑开并稳定住的那股力。

如果硬要压缩成一句话，I_0 就像模型里那个“先让计算别跑偏”的总底座。

`I_1`：安静但重要的结构支撑

I_1 更适合被读成支撑模式，而不是最显眼的决策按钮。

在干预式阅读里，改动 I_1 往往不会像改强因果模式那样，立刻把某个词换掉。这不代表它没用，更常见的解释是：它太基础、太基础设施化了，所以表面输出不一定马上剧烈变化。

更合适的人话解释是：

I_1 像一个维持表示形状和系统稳定性的结构模式，让其他更任务化的模式在上面工作。

它更像脚手架，而不是舞台中央的主角。

`I_2`：带有人物指代偏向的引用解析模式

I_2 最好理解的场景，是这类代词提示词：

“Alice gave Bob a book. He thanked her for …”

在这份报告里，放大 I_2 会把输出往更偏人物、偏男性代词解释的方向推。这让它不像一个通用语言模式，而更像一条指代选择通道。

更通俗地说：

I_2 看起来会帮助模型决定，这句话现在到底在跟踪哪一个“人”。

这并不意味着它是一个字面意义上的“男性代词粒子”。更稳妥的理解是：在这个探针设置里，只要模型需要把混杂的语篇上下文压缩成一个明确先行词，I_2 就会稳定参与进来。

`I_3`：更高复杂度的混合协调模式

I_3 不像一个单用途按钮，更像一个负责混合多种约束的协调模式。

它的类自旋复杂度更高，说明它不是沿着一条干净轴工作，而是由几个同样重要的内部方向共同构成。这往往出现在模型需要同时维持多个约束的提示词里。

更合适的人话解释是：

I_3 像是在把几条半成品约束揉成一个可用内部状态的模式。

所以与其把它想成直接拍板某个 token 的按钮，不如把它想成中间层里保持复杂推理或结构理解不散架的那台“混合器”。

`I_4`：与 `I_2` 互补的另一条指代模式

I_4 和 I_2 有相似之处，但在代词类场景里又带着不同的方向偏好。

在这份报告里，放大 I_4 会把输出往 her 这类形式推，而不是中性或其他续写。更通俗的读法是：

I_4 也是一个对指代敏感的模式，只是它和 I_2 在“当前到底在说谁”这个问题上，代表了不同的语篇落点。

这点很重要，因为它说明“代词跟踪”不是一个整块技能。模型可以把这项工作拆成几条彼此相近、但又不完全相同的内部模式。

把整张谱表压成一段话

把 I_0 到 I_4 合起来看，故事其实很连贯：

I_0 是强而通用的背景底座
I_1 负责稳住结构
I_2 和 I_4 更明显地参与指代选择
I_3 更像复杂约束的混合模式

这就是 Intelliton 视角的价值。它把一大片难以直视的隐藏状态，压缩成一组反复出现的“角色分工”。

哪些地方不要过度解读

这里有两个很重要的保留意见。

物种编号只是记账标签。一次运行里的 I_2，不保证永远和另一次运行里的 I_2 完全等价。
动量、自旋、螺旋度这些词都是代理量。它们是在组织证据，不是在证明网络里真的有字面意义上的粒子。

最稳妥的读法是：这些标签在帮我们总结反复出现的激活角色。

继续阅读

What Are Intellitons? A Friendly Guide to the Lattice-Field View

2026-04-01T00:00:00+00:00

Start with the least mysterious version

The Intelliton project is not claiming that a language model secretly contains real physical particles.

The core idea is simpler and more useful than that: take the transformer residual stream, write it in a coordinate system that physicists already know how to reason about, and ask whether stable, recurrent modes appear.

At one layer, the residual stream is just a matrix:

T rows for token positions
D columns for hidden channels

You can think of it as a long row of sensors. Each token position has thousands of readings. The question is not whether any single neuron matters, but whether the whole pattern can be compressed into a small set of reusable modes.

The sensor analogy

Imagine a sentence with 20 token positions. At each position, instead of one reading, you have a vector with thousands of numbers. That is what one layer of the residual stream looks like.

Now ask four very ordinary questions:

Along the token axis, is the pattern smooth or rapidly oscillating?
Inside hidden channels, does it point mostly in one direction or is it a messy mixture?
As layers get deeper, does the pattern survive or die out quickly?
Does the pattern’s internal structure stay tied to a preferred propagation direction?

Those four questions become the project’s four main diagnostics:

Momentum answers question 1.
Spin-like complexity answers question 2.
Mass answers question 3.
Helicity proxy answers question 4.

This is why the physics language is useful. It gives a compact way to talk about four different facets of the same hidden pattern.

A new coordinate system, not a new ontology

The project rewrites the residual stream in a very specific way:

the token axis is treated like a one-dimensional lattice in space,
the layer axis is treated like discrete time,
the hidden channels are treated like internal degrees of freedom.

That mapping is the whole point. It does not say that text is literally matter. It says that a familiar toolkit from lattice field theory can be borrowed to organise activation patterns.

In code, the main definitions live in src/lattice_field.py, and the overall orchestration sits in src/intelliton_analyzer.py.

What momentum means here

Momentum in this project is just a Fourier description of how a mode varies across token positions.

If the dominant momentum is near k = 0, the pattern is broad and smooth across the sequence.
If the dominant momentum is large, the pattern flips more sharply from one token to the next.

An everyday analogy is an audio equalizer:

low frequency means slow, smooth variation,
high frequency means fast, jagged variation.

So when a report says a mode is low-momentum, it is usually saying: this is a sequence-scale pattern, not a tiny local blip tied to one token.

What spin-like complexity means here

This is the term most likely to confuse readers, because it is not literal particle spin.

The project uses SVD to split a layer into dominant modes. In plain language, SVD asks:

Can this complicated activation matrix be explained mostly by one or two big patterns, or do we need many equally important patterns?

If one mode dominates, the internal structure is simple and concentrated. If energy is spread across many directions, the structure is more mixed and complex. The blog and code call that a spin-like quantity, but the safer mental model is simply internal complexity.

What mass means here

Mass is the most intuitive part of the analogy.

The layer axis is treated like discrete time, and the analysis tracks whether a mode’s strength fades quickly or persists through many layers.

a light mode survives for a long depth range,
a heavy mode dies out quickly.

So mass in this framework is really a measure of how easily a pattern propagates through the network, not how much it weighs in any everyday sense.

What helicity means here

Helicity is also a proxy, not a literal high-energy-physics observable.

The simplified question is: if a mode has a preferred direction on the token lattice, does its internal structure stay aligned with that direction across layers?

If yes, the mode has a more stable directional signature. If not, the mode is being scrambled.

This is useful because two modes can have similar amplitude but very different directional stability.

Why this framing helps

Once the residual stream is written this way, the project can ask practical questions that are hard to state cleanly in raw neuron space:

Which patterns are global versus local across the sequence?
Which patterns are internally simple versus heavily mixed?
Which patterns are shallow noise versus deep, persistent carriers?
Which patterns stay stable across prompts, tasks, and generation steps?

That is the value of Intellitons. They are a compact language for recurring activation patterns. They are useful if they organise observations better than a giant pile of raw activations.

The shortest correct summary

If you want the plainest possible version, it is this:

Intellitons are recurring residual-stream modes described in a physics-inspired coordinate system. DFT tells you how they vary across tokens, SVD tells you how internally concentrated they are, propagator decay tells you how far they travel across layers, and helicity tells you whether their internal structure keeps a stable directional signature.

The next article makes that concrete by showing how to read a spectrum report and what I_0 to I_4 sound like in ordinary language.

Continue reading

先从最不神秘的版本开始

Intelliton 项目不是在说语言模型里真的藏着物理粒子。

更准确、更实用的说法是：把变换器的残差流换到一套物理学家已经很熟悉的坐标系里，再去看里面会不会出现稳定、反复出现、可以跨层追踪的模式。

对某一层来说，残差流不过是一个矩阵：

行数 T 表示 token 位置
列数 D 表示 hidden channels

你可以把它想成一排传感器。每个 token 位置上都有成千上万个读数。项目真正关心的，不是某一个神经元是否重要，而是整块信号能不能被少数几个可重复使用的主模式概括出来。

最通俗的类比：一排传感器

想象一句话有 20 个 token 位置。每个位置上不是一个数字，而是一整个上千维的读数向量。这就是某一层残差流的大致样子。

现在问四个很朴素的问题：

沿着 token 轴，这个模式是平滑变化，还是快速振荡？
在 hidden channels 里，它更像单一方向，还是复杂混合？
随着层数加深，它能持续很久，还是很快消失？
它的内部结构，是否一直和某个传播方向绑定在一起？

这四个问题，正好对应项目里的四个主诊断量：

动量对应第 1 个问题
类自旋复杂度 对应第 2 个问题
质量对应第 3 个问题
螺旋度代理量 对应第 4 个问题

这就是为什么物理语言在这里有用。它把同一批隐藏模式的四个不同侧面，用一套紧凑的词汇串了起来。

这是一套新坐标系，不是一套新本体论

项目把残差流这样重写：

token 轴 看成一维晶格上的空间
layer 轴 看成离散时间
hidden channels 看成内部自由度

重点就在这一步。它不是说文本真的变成了物质，而是说可以借用晶格场论里熟悉的工具，来整理模型内部的激活模式。

在代码里，主要定义集中在 src/lattice_field.py，总流程由 src/intelliton_analyzer.py 串起来。

这里的“动量”到底是什么意思

在这个项目里，动量只是描述模式沿 token 位置如何变化的一种傅里叶坐标。

如果主导动量接近 k = 0，说明这个模式在整个序列上比较平滑、比较全局。
如果主导动量较大，说明它在相邻 token 之间切换更快、振荡更强。

最容易懂的比喻是音频均衡器：

低频意味着缓慢、平滑的变化
高频意味着尖锐、快速的起伏

所以当报告说某个模式是低动量，它通常不是在说“速度慢”，而是在说：这更像一个覆盖整段序列的大尺度模式，而不是绑在某个 token 上的小噪声。

这里的“自旋”为什么其实是在看复杂度

这个词最容易让人误会，因为它不是粒子物理里的严格自旋。

项目用 SVD 把某一层拆成若干个主模式。人话版的问题其实是：

这一层看起来很复杂，但它是不是主要由一两个大模式支配，还是说必须靠很多差不多重要的模式一起才能解释？

如果一个模式特别突出，说明内部结构更集中、更简单。如果能量分散在许多方向上，说明内部结构更混合、更复杂。博客和代码把这个量借用物理语言叫成 spin-like，但更稳妥的理解就是 内部复杂度。

这里的“质量”为什么就是跨层能活多久

质量是整套类比里最直观的一步。

项目把 layer 轴当成离散时间，然后看一个模式的强度会不会在更深的层里迅速衰减。

轻模式 能持续很多层
重模式 很快就消失

所以这里的质量，实质上是在衡量一个模式穿透网络深度的能力，而不是日常意义上的“有多重”。

这里的“螺旋度”为什么只是方向性代理量

螺旋度在这里也只是代理量，不是高能物理里那种严格可观测量。

更简单的问法是：如果某个模式在 token 晶格上有偏好的传播方向，它的内部结构会不会在跨层传播时一直和这个方向绑在一起？

如果会，说明这个模式的方向性签名更稳定。如果不会，说明它在层间被打散了。

这很有用，因为两个模式即使振幅差不多，方向稳定性也可能完全不同。

为什么这套说法有帮助

一旦把残差流写成这种形式，项目就能提出一些用原始神经元空间很难直接说清的问题：

哪些模式是全局的，哪些更局部？
哪些模式内部很集中，哪些高度混合？
哪些模式只是浅层噪声，哪些能一路传到深层？
哪些模式能跨提示词、跨任务、跨生成步骤保持稳定？

Intelliton 的价值就在这里。它提供了一套压缩语言，去描述那些反复出现的激活模式。只要这套语言比一大堆原始激活更能组织观察结果，它就是有用的。

一句话总结这件事

如果只保留最通俗也最准确的一句话，那就是：

Intelliton 是用物理启发坐标系描述出来的残差流重复模式。DFT 看它沿 token 怎么变化，SVD 看它内部是否集中，传播子衰减看它能走多深，螺旋度看它的内部结构是否保留稳定的方向性。

下一篇文章会把这件事落到更具体的谱表上，直接教你怎么看 I_0 到 I_4。