<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://intellitons.wiki/feed.xml" rel="self" type="application/atom+xml" /><link href="https://intellitons.wiki/" rel="alternate" type="text/html" hreflang="en" /><updated>2026-04-16T15:27:12+00:00</updated><id>https://intellitons.wiki/feed.xml</id><title type="html">Intellitons Blog</title><subtitle>Popular science articles about Intellitons — the quasi-particle-like collective excitation modes discovered inside large language models.</subtitle><author><name>Intellitons Project</name></author><entry><title type="html">Safety Alignment Through the Intelliton Lens: Toward Structural Guarantees</title><link href="https://intellitons.wiki/safety/alignment/research-directions/2026/04/08/safety-alignment-intelliton-landscape.html" rel="alternate" type="text/html" title="Safety Alignment Through the Intelliton Lens: Toward Structural Guarantees" /><published>2026-04-08T00:00:00+00:00</published><updated>2026-04-08T00:00:00+00:00</updated><id>https://intellitons.wiki/safety/alignment/research-directions/2026/04/08/safety-alignment-intelliton-landscape</id><content type="html" xml:base="https://intellitons.wiki/safety/alignment/research-directions/2026/04/08/safety-alignment-intelliton-landscape.html"><![CDATA[<div data-lang="en">

  <h2 id="the-uncomfortable-lesson-from-gemma-4">The uncomfortable lesson from Gemma 4</h2>

  <p>The ARA jailbreak of Gemma 4 in April 2026 demonstrated something that the AI safety community
had long feared but struggled to quantify: <strong>RLHF-imposed alignment is not a deep architectural
property of the model — it is a separable spectral overlay.</strong></p>

  <p>The implication is stark. Any open-source model, no matter how carefully aligned during training,
can have its alignment stripped by someone with:</p>
  <ul>
    <li>access to the model weights,</li>
    <li>a few hundred forward passes to collect contrast activations,</li>
    <li>a laptop and a few minutes of linear algebra.</li>
  </ul>

  <p>This is not a failure of any particular alignment method. It is a structural property of how
current RLHF and DPO work: they shift the model’s behavioural outputs by adjusting the magnitudes
and directions of a small set of residual-stream modes, but they do not fundamentally restructure
the mode landscape inherited from pre-training.</p>

  <hr />

  <h2 id="what-thin-alignment-looks-like-in-intelliton-terms">What “thin alignment” looks like in Intelliton terms</h2>

  <p>The alignment-vs-base comparison in
<a href="/comparison/scaling/alignment/2026/04/03/scaling-alignment-intellitons.html">Scaling and Alignment Through the Intelliton Lens</a>
shows that instruction tuning creates measurable but limited changes to the Intelliton spectrum:</p>

  <ul>
    <li>a shift in dominant momentum (the alignment process changes the sequence-scale structure of the
main backbone mode),</li>
    <li>a modest reduction in the number of species (some modes are suppressed or merged),</li>
    <li>a shift in fixed-point structure (all points become crossovers, suggesting a more uniform
propagation regime).</li>
  </ul>

  <p>What does <em>not</em> change is the fundamental mode landscape. The same species types appear in both the
base and instruct models. The instruct model has had some modes adjusted and one or two new ones
added, but the bulk of the residual-stream dynamics are inherited directly from pre-training.</p>

  <p>In Abliteration terms, this means the refusal modes sit <em>on top of</em> the task-solving modes rather
than being <em>woven into</em> them. Removing the refusal modes does not substantially disturb the
task-solving modes, which is why jailbroken models retain their capabilities.</p>

  <hr />

  <h2 id="the-structural-alignment-hypothesis">The structural alignment hypothesis</h2>

  <p>The Intelliton framework suggests a more robust alignment paradigm:</p>

  <blockquote>
    <p><strong>Structural alignment</strong>: instead of adding refusal modes on top of the existing mode landscape
(current RLHF), train the model such that safety-relevant mode properties are <em>architecturally
entangled</em> with capability-relevant modes across many layers and many tasks.</p>
  </blockquote>

  <p>Under structural alignment, removing a safety mode would necessarily degrade a capability mode,
because the two would share subspace components across multiple layers. The cost of ablation rises
from near-zero to a genuine capability penalty.</p>

  <p>This is an analogy to the cancer treatment metaphor in the ARA literature: instead of making
cancer cells identifiable for targeted removal (current RLHF), make them biologically inseparable
from healthy tissue in a way that deters removal.</p>

  <hr />

  <h2 id="three-concrete-research-directions">Three concrete research directions</h2>

  <h3 id="direction-1--measure-current-alignment-depth">Direction 1 — Measure current alignment depth</h3>

  <p>The first step is to quantify how separable alignment modes actually are, using the Intelliton
framework as the measurement instrument.</p>

  <p><strong>Protocol:</strong></p>
  <ol>
    <li>For a matched pair of base and instruct models (e.g., <code class="language-plaintext highlighter-rouge">Qwen3-8B-Base</code> and <code class="language-plaintext highlighter-rouge">Qwen3-8B</code>), compute
the per-layer Intelliton species catalogue for both.</li>
    <li>For each species in the instruct model, compute its cosine similarity with the closest species
in the base model at the same layer.</li>
    <li>Define an <strong>alignment depth score</strong> as the fraction of alignment-specific modes (modes present
in the instruct model but not in the base model, or modes with significantly shifted spectral
properties) that have low cosine overlap with <em>all</em> task-solving modes.</li>
  </ol>

  <p>A high alignment depth score means alignment modes are deeply entangled with task modes
(structurally hard to remove). A low score means they are orthogonal (structurally easy to remove).</p>

  <p>The hypothesis is that current RLHF produces a low alignment depth score, and that this is
measurable with the existing Intelliton toolkit before any jailbreak attempt.</p>

  <h3 id="direction-2--design-training-objectives-that-increase-alignment-depth">Direction 2 — Design training objectives that increase alignment depth</h3>

  <p>If alignment depth is measurable, it becomes a trainable objective.</p>

  <p>The proposed training signal would add a <strong>mode entanglement regularisation term</strong> to the RLHF or
DPO loss. The term penalises configurations where safety-relevant mode directions are orthogonal to
capability-relevant mode directions at the same layer:</p>

\[\mathcal{L}_{\text{entanglement}} = -\sum_{\ell} \sum_{s \in \text{safety}} \sum_{c \in \text{capability}} \left| \langle \hat{v}_{s,\ell}, \hat{v}_{c,\ell} \rangle \right|\]

  <p>Minimising this term (as a penalty) during alignment training would push the model toward
configurations where safety modes share subspace components with capability modes — increasing the
cost of surgical removal.</p>

  <p>This is speculative, but it is testable at small scale on the models already analysed by the
Intelliton project.</p>

  <h3 id="direction-3--use-intelliton-audits-as-a-pre-deployment-safety-check">Direction 3 — Use Intelliton audits as a pre-deployment safety check</h3>

  <p>Even before structural alignment is achievable, the Intelliton framework can be used as a
<strong>pre-deployment safety audit</strong> for open-source models.</p>

  <p>The audit would:</p>
  <ol>
    <li>Run the Intelliton analysis on the released model with contrast prompt sets.</li>
    <li>Compute the alignment depth score.</li>
    <li>Report the estimated minimum cost of abliteration (how many modes need to be removed, what the
expected capability penalty is).</li>
  </ol>

  <p>This would give the open-source community a standardised, interpretable metric for alignment
robustness — something that is currently entirely absent from model release documentation.</p>

  <hr />

  <h2 id="the-deeper-issue-teach-the-model-to-not-say-versus-teach-the-model-to-not-know">The deeper issue: “teach the model to not say” versus “teach the model to not know”</h2>

  <p>The Abliteration literature makes a pointed observation that maps directly onto the Intelliton
framework:</p>

  <blockquote>
    <p>“Teaching the model not to say” (current RLHF) can be defeated. “Teaching the model not to know”
(removing the capability from the pre-training stage) cannot be defeated by post-hoc ablation.</p>
  </blockquote>

  <p>In Intelliton terms:</p>

  <ul>
    <li><strong>Behavioural alignment</strong> (current RLHF) adds a small number of low-complexity, separable refusal
modes that sit orthogonally to the capability modes. These can be removed with targeted ablation.</li>
    <li><strong>Capability-level safety</strong> would require that certain capability modes — the ones that underlie
dangerous knowledge — are never formed during pre-training, or are formed in such a way that they
are deeply entangled with unrelated benign modes.</li>
  </ul>

  <p>The Intelliton framework cannot, by itself, implement capability-level safety. But it can <em>measure</em>
the difference: a model with capability-level safety for a particular dangerous capability would
show, under Intelliton analysis, that the dangerous-knowledge modes are spectrally entangled with
benign modes in a way that makes targeted removal impossible without broad capability degradation.</p>

  <p>This becomes a falsifiable, quantitative prediction that can be tested on released models.</p>

  <hr />

  <h2 id="why-bengios-warning-deserves-a-technical-interpretation">Why Bengio’s warning deserves a technical interpretation</h2>

  <p>Yoshua Bengio, one of the three godfathers of deep learning, has consistently argued that open-sourcing powerful models is dangerous, because once the weights are released, the alignment can be
removed by anyone with modest technical resources.</p>

  <p>The Intelliton framework gives that warning a technical, measurable form:</p>

  <blockquote>
    <p><strong>A model’s alignment robustness is bounded above by its alignment depth score. Current models,
based on the spectral evidence already available from base/instruct comparisons, have low
alignment depth scores.</strong></p>
  </blockquote>

  <p>This is not a political statement. It is a quantitative prediction that can be tested, and that, if
it holds, tells us that the current open-source release paradigm for aligned models carries
measurable safety risks that can be expressed in the language of residual-stream spectral analysis.</p>

  <hr />

  <h2 id="the-research-agenda-in-summary">The research agenda in summary</h2>

  <table>
    <thead>
      <tr>
        <th>Step</th>
        <th>What to measure</th>
        <th>What it tells us</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Alignment depth audit</td>
        <td>Cosine overlap between safety modes and task modes, per layer</td>
        <td>How separable current alignment is</td>
      </tr>
      <tr>
        <td>Alignment depth score across model families</td>
        <td>Score vs. model size, RLHF method, training data</td>
        <td>What factors increase structural alignment</td>
      </tr>
      <tr>
        <td>Entanglement regularisation experiment</td>
        <td>Alignment depth score before and after training with mode-entanglement loss</td>
        <td>Whether structural alignment is trainable</td>
      </tr>
      <tr>
        <td>Pre-deployment audit protocol</td>
        <td>Standardised depth score at release time</td>
        <td>A public, interpretable alignment robustness metric</td>
      </tr>
    </tbody>
  </table>

  <p>Each of these steps is feasible using the infrastructure already developed in the Intelliton
project. The first step requires only a new set of contrast prompts and a small extension to the
existing analysis pipeline.</p>

  <hr />

  <h2 id="the-shortest-summary">The shortest summary</h2>

  <ul>
    <li>Current RLHF alignment is spectrally thin: alignment modes are separable from capability modes,
and this separability is what makes Abliteration/ARA work.</li>
    <li>The Intelliton framework can measure this separability as a quantitative <strong>alignment depth score</strong>.</li>
    <li>A research direction based on this measurement would pursue <strong>structural alignment</strong> — training
objectives that increase mode entanglement and make ablation genuinely costly.</li>
    <li>Even before structural alignment is achieved, the Intelliton audit provides a standardised
pre-deployment robustness metric that is currently entirely absent.</li>
  </ul>

  <hr />

  <h2 id="continue-reading">Continue reading</h2>

  <ul>
    <li><a href="/alignment/safety/research-directions/2026/04/06/refusal-as-intelliton.html">Refusal as an Intelliton</a></li>
    <li><a href="/representation-engineering/alignment/research-directions/2026/04/07/representation-engineering-intelliton-steering.html">Representation Engineering and Intelliton Steering</a></li>
  </ul>

</div>

<div data-lang="zh">

  <h2 id="gemma-4-">Gemma 4 带来的令人不安的教训</h2>

  <p>2026 年 4 月 Gemma 4 遭受 ARA 越狱，向 AI 安全社区证明了一件他们早已担忧却难以量化的事：
<strong>RLHF 引入的对齐，不是模型的深层架构属性，而是一个可分离的谱覆盖层。</strong></p>

  <p>这意味着，不管在训练中做了多少精心的对齐工作，任何开源模型都可以被拥有以下资源的人剥除
对齐：</p>
  <ul>
    <li>模型权重的访问权限；</li>
    <li>几百次前向传播，用于收集对比激活；</li>
    <li>一台笔记本电脑和几分钟的线性代数运算。</li>
  </ul>

  <p>这不是某种特定对齐方法的失败，而是当前 RLHF 和 DPO 工作方式的结构性属性：它们通过调整
少数几个残差流模式的量级和方向来改变模型的行为输出，但并没有从根本上重构继承自预训练的
模式景观。</p>

  <hr />

  <h2 id="intelliton-">“薄对齐”在 Intelliton 语言里长什么样</h2>

  <p><a href="/comparison/scaling/alignment/2026/04/03/scaling-alignment-intellitons.html">用 Intelliton 视角看规模扩展与对齐</a>
中的对比表明，指令微调确实对 Intelliton 谱产生了可测量但有限的变化：</p>

  <ul>
    <li>主导动量发生偏移（对齐过程改变了主干模式的序列尺度结构）；</li>
    <li>物种数量小幅减少（一些模式被抑制或合并）；</li>
    <li>不动点结构发生变化（所有不动点变成 crossover，意味着更均匀的传播机制）。</li>
  </ul>

  <p>没有改变的是基本的模式景观。base 模型和 instruct 模型中出现的物种类型相同。instruct 模型
对一些模式做了调整，添加了一两个新模式，但残差流动力学的主体是直接从预训练继承而来的。</p>

  <p>用 Abliteration 的语言说，这意味着拒绝模式是<em>叠加在</em>任务求解模式之上的，而不是<em>编织进</em>
任务求解模式里的。移除拒绝模式不会实质性地扰乱任务求解模式，这就是越狱模型仍然保有能力
的原因。</p>

  <hr />

  <h2 id="section">结构性对齐假说</h2>

  <p>Intelliton 框架提示了一种更稳健的对齐范式：</p>

  <blockquote>
    <p><strong>结构性对齐</strong>：不是把拒绝模式叠加在已有模式景观之上（当前 RLHF），而是训练模型，使安全
相关的模式属性在架构层面与能力相关的模式<em>相互纠缠</em>，遍布多层、多任务。</p>
  </blockquote>

  <p>在结构性对齐下，移除一个安全模式必然会损害一个能力模式，因为两者在多个层上共享子空间分
量。消融的代价从接近零，上升为真实的能力损失。</p>

  <p>这与 ARA 文献中的癌细胞治疗类比相似：不是把癌细胞标记出来以便精准切除（当前 RLHF），而
是让它们在生物学上与健康组织不可分离，从而从根本上阻止切除。</p>

  <hr />

  <h2 id="section-1">三个具体研究方向</h2>

  <h3 id="section-2">方向一 —— 测量当前对齐深度</h3>

  <p>第一步是量化当前对齐模式的实际可分离程度，以 Intelliton 框架作为测量工具。</p>

  <p><strong>方案：</strong></p>
  <ol>
    <li>对一对匹配的 base/instruct 模型（例如 <code class="language-plaintext highlighter-rouge">Qwen3-8B-Base</code> 和 <code class="language-plaintext highlighter-rouge">Qwen3-8B</code>），分别计算逐层的
Intelliton 物种目录；</li>
    <li>对 instruct 模型中每个物种，计算其与 base 模型同一层最近邻物种的余弦相似度；</li>
    <li>将<strong>对齐深度分数</strong>定义为：对齐专属模式（出现在 instruct 但不在 base 中，或谱属性发生显
著偏移的模式）中，与所有任务求解模式的余弦重叠都低的那部分比例。</li>
  </ol>

  <p>对齐深度分数高，说明对齐模式与任务模式深度纠缠（结构上难以移除）；分数低，说明它们是正
交的（结构上易于移除）。</p>

  <p>假设是：当前 RLHF 产生的对齐深度分数偏低，而且这可以用现有的 Intelliton 工具集在任何越狱
尝试之前就测量出来。</p>

  <h3 id="section-3">方向二 —— 设计能提升对齐深度的训练目标</h3>

  <p>如果对齐深度可以被测量，它就可以成为一个可训练的目标。</p>

  <p>提议的训练信号，是在 RLHF 或 DPO 损失中加入一个<strong>模式纠缠正则化项</strong>。该项惩罚安全相关模
式方向与同层能力相关模式方向正交的配置：</p>

\[\mathcal{L}_{\text{entanglement}} = -\sum_{\ell} \sum_{s \in \text{safety}} \sum_{c \in \text{capability}} \left| \langle \hat{v}_{s,\ell}, \hat{v}_{c,\ell} \rangle \right|\]

  <p>在对齐训练中最小化这一惩罚项，会推动模型朝向安全模式与能力模式共享子空间分量的配置——
从而提高外科手术式移除的代价。</p>

  <p>这是一个推测性方向，但可以在 Intelliton 项目已经分析过的小规模模型上加以检验。</p>

  <h3 id="intelliton--1">方向三 —— 把 Intelliton 审计用作部署前安全检查</h3>

  <p>即使在结构性对齐尚未实现之前，Intelliton 框架也可以用作开源模型的<strong>部署前安全审计</strong>工具。</p>

  <p>审计流程包括：</p>
  <ol>
    <li>用对比提示词集对发布模型运行 Intelliton 分析；</li>
    <li>计算对齐深度分数；</li>
    <li>报告消融的估计最低代价（需要移除多少模式，预期能力损失是多少）。</li>
  </ol>

  <p>这将为开源社区提供一个标准化、可解释的对齐鲁棒性指标——而这正是目前模型发布文档中完全
缺失的东西。</p>

  <hr />

  <h2 id="section-4">更深层的问题：”教模型不说”与”教模型真不懂”</h2>

  <p>Abliteration 文献中有一个直接映射到 Intelliton 框架的深刻观察：</p>

  <blockquote>
    <p>“教模型不说”（当前 RLHF）可以被攻破。”教模型真不懂”（在预训练阶段就移除该能力）无法
被事后消融攻破。</p>
  </blockquote>

  <p>用 Intelliton 的语言说：</p>

  <ul>
    <li><strong>行为对齐</strong>（当前 RLHF）添加了少数低复杂度、可分离的拒绝模式，它们与能力模式正交。
这些模式可以用定向消融移除。</li>
    <li><strong>能力层面的安全</strong>，则需要某些能力模式——那些支撑危险知识的模式——从未在预训练中形
成，或者以与其他良性模式深度纠缠的方式形成，使得定向移除不可能在不引发大规模能力退化
的情况下完成。</li>
  </ul>

  <p>Intelliton 框架本身无法实现能力层面的安全，但它能<em>测量</em>这种差异：一个针对某种特定危险能
力实现了能力层面安全的模型，在 Intelliton 分析下会表现出，危险知识模式与良性模式在谱上的
纠缠程度，使得任何定向移除都不可能在不引发广泛能力退化的情况下完成。</p>

  <p>这成为了一个可检验的、定量的预测，可以在已发布模型上加以验证。</p>

  <hr />

  <h2 id="section-5">为什么本吉奥的警告值得一个技术性解读</h2>

  <p>深度学习三巨头之一的 Yoshua Bengio，一直坚持认为开源强大模型是危险的，因为一旦权重被公开，
任何拥有适度技术资源的人都可以移除对齐。</p>

  <p>Intelliton 框架为这一警告赋予了技术性、可量化的形式：</p>

  <blockquote>
    <p><strong>模型的对齐鲁棒性，其上界就是它的对齐深度分数。根据 base/instruct 对比中已有的谱证据，
当前模型的对齐深度分数偏低。</strong></p>
  </blockquote>

  <p>这不是政治表态，而是一个可以检验的定量预测。如果它成立，就告诉我们：当前已对齐模型的开
源发布范式，携带着可测量的安全风险，而这些风险可以用残差流谱分析的语言来表达。</p>

  <hr />

  <h2 id="section-6">研究议程总结</h2>

  <table>
    <thead>
      <tr>
        <th>步骤</th>
        <th>测量什么</th>
        <th>告诉我们什么</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>对齐深度审计</td>
        <td>安全模式与任务模式的逐层余弦重叠</td>
        <td>当前对齐的可分离程度</td>
      </tr>
      <tr>
        <td>跨模型家族对齐深度分数</td>
        <td>分数 vs. 模型大小、RLHF 方法、训练数据</td>
        <td>哪些因素提升结构性对齐</td>
      </tr>
      <tr>
        <td>纠缠正则化实验</td>
        <td>训练前后的对齐深度分数</td>
        <td>结构性对齐是否可训练</td>
      </tr>
      <tr>
        <td>部署前审计协议</td>
        <td>发布时的标准化深度分数</td>
        <td>公开的、可解释的对齐鲁棒性指标</td>
      </tr>
    </tbody>
  </table>

  <p>上述每个步骤，用 Intelliton 项目已有的基础设施都是可行的。第一步只需要一组新的对比提示词，
以及对现有分析流程的少量扩展。</p>

  <hr />

  <h2 id="section-7">最短总结</h2>

  <ul>
    <li>当前 RLHF 对齐在谱层面是薄的：对齐模式与能力模式可分离，而这种可分离性正是
Abliteration/ARA 得以奏效的原因；</li>
    <li>Intelliton 框架能将这种可分离性量化为<strong>对齐深度分数</strong>；</li>
    <li>基于这一测量的研究方向，将追求<strong>结构性对齐</strong>——提升模式纠缠程度、使消融变得真正代价
高昂的训练目标；</li>
    <li>即使在结构性对齐尚未实现之前，Intelliton 审计也提供了一个标准化的部署前鲁棒性指标，而
这正是目前完全缺失的。</li>
  </ul>

  <hr />

  <h2 id="section-8">继续阅读</h2>

  <ul>
    <li><a href="/alignment/safety/research-directions/2026/04/06/refusal-as-intelliton.html">拒绝即 Intelliton</a></li>
    <li><a href="/representation-engineering/alignment/research-directions/2026/04/07/representation-engineering-intelliton-steering.html">表征工程与 Intelliton 引导</a></li>
  </ul>

</div>]]></content><author><name>Intellitons Project</name></author><category term="safety" /><category term="alignment" /><category term="research-directions" /><summary type="html"><![CDATA[RLHF-based alignment has been shown to be a thin spectral overlay that can be removed in minutes on any open-source model. This article argues that the Intelliton framework offers a route toward something more robust: structural alignment — where safety-relevant modes are architecturally entangled with capability modes, making removal costly rather than free.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://intellitons.wiki/assets/icons/android-chrome-512x512.png" /><media:content medium="image" url="https://intellitons.wiki/assets/icons/android-chrome-512x512.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Representation Engineering and Intelliton Steering: A Research Proposal</title><link href="https://intellitons.wiki/representation-engineering/alignment/research-directions/2026/04/07/representation-engineering-intelliton-steering.html" rel="alternate" type="text/html" title="Representation Engineering and Intelliton Steering: A Research Proposal" /><published>2026-04-07T00:00:00+00:00</published><updated>2026-04-07T00:00:00+00:00</updated><id>https://intellitons.wiki/representation-engineering/alignment/research-directions/2026/04/07/representation-engineering-intelliton-steering</id><content type="html" xml:base="https://intellitons.wiki/representation-engineering/alignment/research-directions/2026/04/07/representation-engineering-intelliton-steering.html"><![CDATA[<div data-lang="en">

  <h2 id="two-ideas-that-belong-together">Two ideas that belong together</h2>

  <p><strong>Representation engineering</strong> is the practice of directly reading and writing to a model’s
internal activations — without changing any weights — to steer its behaviour. A growing body of
work shows that concepts like “happiness”, “authority”, “political bias”, and “honesty” can be
encoded as linear directions in the residual stream, and that adding or subtracting a small multiple
of those directions at inference time reliably changes what the model outputs.</p>

  <p><strong>The Intelliton framework</strong> characterises the residual stream as a space of quasi-particle-like
modes. It extracts recurring patterns and labels them by their spectral properties: momentum,
spin-like complexity, mass, and helicity.</p>

  <p>These two ideas are describing the same object — the residual stream — at different levels of
abstraction. Representation engineering says “this direction steers this behaviour”. The Intelliton
framework says “this mode has these spectral properties”. Combining them makes both more useful.</p>

  <hr />

  <h2 id="what-representation-engineering-can-do-and-what-it-cannot-say-on-its-own">What representation engineering can do (and what it cannot say on its own)</h2>

  <p>The tools most associated with representation engineering — activation addition, contrastive
activation analysis, and the ARA method used to jailbreak Gemma 4 — share a common limitation:
they can identify <em>which direction to push</em> but they say little about <em>the structure of that
direction in the broader activation space</em>.</p>

  <p>Specifically:</p>

  <ul>
    <li>
      <p><strong>Activation addition</strong> adds a fixed direction to a chosen layer’s residual stream at every
token position. It works reliably for simple concepts, but it can degrade performance when the
steered direction overlaps with important task-solving modes.</p>
    </li>
    <li>
      <p><strong>Contrastive activation analysis</strong> (the core of Abliteration) identifies the mean difference
between two contrastive sets of activations. It finds the refusal direction efficiently, but it
does not tell you how many modes are involved, what those modes’ propagation properties are, or
how much overlap they have with the task-solving modes you want to preserve.</p>
    </li>
    <li>
      <p><strong>ARA</strong> improves on simple subtraction by working in a low-rank subspace rather than a single
direction. It uses SVD to separate the refusal subspace, but it does not connect the separated
components to a broader characterisation of the model’s mode landscape.</p>
    </li>
  </ul>

  <p>The Intelliton framework fills exactly these gaps.</p>

  <hr />

  <h2 id="the-steering-map-proposal">The steering map proposal</h2>

  <p>The research direction proposed here is to build what we can call an <strong>Intelliton steering map</strong>:
a catalogue that annotates each Intelliton species with its likely behavioural role, its
approximate layer range, its rank in the residual stream, and its overlap with other species.</p>

  <h3 id="building-the-map-three-ingredients">Building the map: three ingredients</h3>

  <p><strong>Ingredient 1 — Task probes</strong></p>

  <p>Use the five prompt families from <code class="language-plaintext highlighter-rouge">src/datasets.py</code> (pronoun tracking, factual recall, logical
reasoning, arithmetic, syntactic agreement) to establish which Intelliton species are activated by
which kind of task. This is already partially done by the existing analysis.</p>

  <p><strong>Ingredient 2 — Behavioural probes</strong></p>

  <p>Add a new class of probes targeting RLHF-trained behaviours:</p>
  <ul>
    <li>refusal (harmful vs. harmless prompts),</li>
    <li>sycophancy (flattery vs. neutral prompts),</li>
    <li>political neutrality (controversial vs. neutral framings),</li>
    <li>verbosity control (instructed-brief vs. instructed-elaborate prompts).</li>
  </ul>

  <p>Run the same Intelliton analysis on each behavioural probe set and record which species respond.</p>

  <p><strong>Ingredient 3 — Cross-probe overlap</strong></p>

  <p>Compute the pairwise cosine similarity between all per-layer refusal vectors and all per-layer
task-activation vectors. Species with low overlap across all task probes are good steering targets:
adding or removing them will not bleed into task performance.</p>

  <hr />

  <h2 id="the-connection-to-ara">The connection to ARA</h2>

  <p>The ARA technique constructs a rank-\(k\) penalty matrix \(\Delta W\) that projects out the
refusal subspace from the model’s weight matrices. In Intelliton terms, \(\Delta W\) is a
targeted suppression of a small set of Intelliton species.</p>

  <p>The key claim of ARA is that a higher-rank intervention is safer than a rank-1 intervention
(simple vector subtraction) because the refusal behaviour in a capable reasoning model spans
multiple entangled modes. If you only remove the rank-1 component, the remaining components
continue to generate partial refusals or degrade the model’s reasoning.</p>

  <p>This claim can be tested directly using the Intelliton framework:</p>

  <ol>
    <li>Compute the Intelliton spectrum of an instruction-tuned model on harmful prompts.</li>
    <li>Identify the modes that are most active on harmful prompts and least active on harmless prompts.</li>
    <li>Measure whether those modes are clustered in a low-dimensional subspace of the per-layer SVD
basis, or whether they are spread across many independent directions.</li>
  </ol>

  <p>If they are clustered, ARA’s rank-\(k\) approach is justified and the clustering rank \(k\) can be
estimated from the Intelliton spectrum before any jailbreak attempt is made. If they are spread,
simple subtraction methods are expected to leave residual refusal capability or cause broader
collateral damage.</p>

  <hr />

  <h2 id="a-practical-application-zero-shot-concept-injection">A practical application: zero-shot concept injection</h2>

  <p>The reverse direction — concept injection — is equally interesting.</p>

  <p>Representation engineering researchers have demonstrated that you can <em>add</em> a concept to a model by
adding its activation direction to the residual stream at inference time. For example, adding a
“confidence” direction makes the model sound more certain; adding a “formality” direction makes its
outputs more formal.</p>

  <p>In Intelliton terms, concept injection is the operation of exciting a new Intelliton species that
was not activated by the input prompt. The Intelliton framework predicts that this will be most
stable when:</p>

  <ul>
    <li>the injected mode has low momentum (broad, sequence-level effect rather than token-local),</li>
    <li>the injected mode has low spin-like complexity (concentrated, easy to steer with a rank-1
intervention),</li>
    <li>the injection is applied at the layer range where the mode has the lowest mass (highest
propagation range).</li>
  </ul>

  <p>These three conditions define a <strong>tractability criterion</strong> for representation engineering
interventions: not all concepts are equally steerable, and the Intelliton spectrum can predict which
ones are tractable before you attempt the intervention.</p>

  <hr />

  <h2 id="enterprise-implications">Enterprise implications</h2>

  <p>The Abliteration/ARA episode revealed a commercially important fact: fine-tuning is not the only
way to customise an open-source model. Representation engineering with Intelliton-guided steering
maps could enable:</p>

  <ul>
    <li><strong>Domain-specific tone calibration</strong> (formal, terse, verbose, empathetic) by identifying and
amplifying or suppressing the relevant low-momentum, low-complexity style modes.</li>
    <li><strong>Compliance mode injection</strong> (make a general model behave as if it were trained on a strict
regulatory corpus) by injecting the compliance Intelliton species identified from a reference
model.</li>
    <li><strong>Persona engineering</strong> (the “Machiavellian” or “street punk” effect described in the Abliteration
literature) by amplifying specific behavioural modes.</li>
  </ul>

  <p>All of these operations require knowing <em>which modes to touch</em> and <em>at which layers</em>. The Intelliton
steering map is precisely that knowledge, expressed in a principled spectral language.</p>

  <hr />

  <h2 id="the-shortest-summary">The shortest summary</h2>

  <ul>
    <li>Representation engineering steers behaviour by writing to the residual stream.</li>
    <li>The Intelliton framework characterises what is already in the residual stream.</li>
    <li>Together, they make it possible to identify <em>which modes to steer</em>, <em>how hard</em>, <em>at which layer</em>,
and <em>at what cost to other modes</em>.</li>
    <li>The proposed Intelliton steering map would turn the species catalogue into a practical intervention
guide for both safety-positive (alignment hardening) and safety-negative (jailbreaking) uses.</li>
  </ul>

  <hr />

  <h2 id="continue-reading">Continue reading</h2>

  <ul>
    <li><a href="/alignment/safety/research-directions/2026/04/06/refusal-as-intelliton.html">Refusal as an Intelliton</a></li>
    <li><a href="/safety/alignment/research-directions/2026/04/08/safety-alignment-intelliton-landscape.html">Safety Alignment Through the Intelliton Lens</a></li>
  </ul>

</div>

<div data-lang="zh">

  <h2 id="section">两个本该放在一起的概念</h2>

  <p><strong>表征工程</strong>（Representation Engineering）是在不改变权重的前提下，直接读写模型内部激活，
从而引导模型行为的实践方法。大量研究表明，”快乐”、”权威”、”政治偏见”、”诚实”等概念可以
被编码为残差流中的线性方向，在推理时加上或减去这些方向的少量倍数，就能可靠地改变模型的
输出。</p>

  <p><strong>Intelliton 框架</strong>将残差流刻画为一个类准粒子模式的空间，提取反复出现的模式，并用谱属性标
注它们：动量、类自旋复杂度、质量和螺旋度。</p>

  <p>这两套思想在描述同一个对象——残差流——只是抽象层次不同。表征工程说”这个方向引导这种行
为”；Intelliton 框架说”这个模式有这些谱属性”。把两者结合起来，两者都会变得更有用。</p>

  <hr />

  <h2 id="section-1">表征工程能做什么（以及它自己说不清什么）</h2>

  <p>与表征工程最相关的工具——激活叠加、对比激活分析，以及用于越狱 Gemma 4 的 ARA 方法——
有一个共同的局限：它们能确定<em>往哪个方向推</em>，但对于<em>那个方向在更广泛激活空间中的结构</em>却
几乎无话可说。</p>

  <p>具体来说：</p>

  <ul>
    <li>
      <p><strong>激活叠加</strong>在每个 token 位置上，向选定层的残差流添加一个固定方向。对于简单概念，效果
可靠，但当被引导方向与重要的任务求解模式重叠时，会降低模型性能。</p>
    </li>
    <li>
      <p><strong>对比激活分析</strong>（Abliteration 的核心）计算两组对比激活的均值差，能高效找到拒绝方向，
但无法告诉你涉及几个模式、这些模式的传播属性是什么，也无法说明它们与你想保留的任务
求解模式有多少重叠。</p>
    </li>
    <li>
      <p><strong>ARA</strong> 改进了简单的减法——它在低秩子空间而非单一方向上操作，用 SVD 分离拒绝子空间，
但没有把分离出的各个分量与模型更广泛的模式景观联系起来。</p>
    </li>
  </ul>

  <p>Intelliton 框架恰好填补了这些空白。</p>

  <hr />

  <h2 id="section-2">引导地图的提案</h2>

  <p>本文提出的研究方向，是构建一张<strong>Intelliton 引导地图</strong>：一份对每个 Intelliton 物种标注其可
能行为角色、大致层范围、在残差流中的秩，以及与其他物种的重叠程度的目录。</p>

  <h3 id="section-3">构建地图：三个要素</h3>

  <p><strong>要素一 —— 任务探针</strong></p>

  <p>使用 <code class="language-plaintext highlighter-rouge">src/datasets.py</code> 中的五类提示词（代词跟踪、事实回忆、逻辑推理、算术、句法一致性），
建立哪类 Intelliton 物种被哪类任务激活的对应关系。这部分已经在现有分析中有所涉及。</p>

  <p><strong>要素二 —— 行为探针</strong></p>

  <p>加入一类新探针，专门针对 RLHF 训练的行为：</p>
  <ul>
    <li>拒绝（有害 vs. 无害提示词）；</li>
    <li>讨好（奉承 vs. 中性提示词）；</li>
    <li>政治中立性（争议性 vs. 中性框架）；</li>
    <li>冗余度控制（指令要求简洁 vs. 指令要求详细的提示词）。</li>
  </ul>

  <p>对每类行为探针集运行同样的 Intelliton 分析，记录哪些物种有响应。</p>

  <p><strong>要素三 —— 跨探针重叠</strong></p>

  <p>计算所有逐层拒绝向量与所有逐层任务激活向量之间的余弦相似度。在所有任务探针上重叠都低的
物种，是好的引导目标：增加或移除它们，不会渗透进任务性能。</p>

  <hr />

  <h2 id="ara-">与 ARA 的联系</h2>

  <p>ARA 技术构造一个秩为 \(k\) 的惩罚矩阵 \(\Delta W\)，把拒绝子空间从模型权重矩阵中投影出
去。用 Intelliton 的语言说，\(\Delta W\) 就是对少数几个 Intelliton 物种的定向抑制。</p>

  <p>ARA 的核心主张是：对于具有强大推理能力的模型，更高秩的干预比秩-1 干预（简单向量减法）更
安全，因为拒绝行为跨越了多个纠缠的模式。如果只移除秩-1 分量，剩余分量会继续产生部分拒绝
或降低模型的推理能力。</p>

  <p>这个主张可以用 Intelliton 框架直接检验：</p>

  <ol>
    <li>在有害提示词上计算指令微调模型的 Intelliton 谱；</li>
    <li>确定在有害提示词上最活跃、在无害提示词上最不活跃的模式；</li>
    <li>度量这些模式是否聚集在逐层 SVD 基的低维子空间里，还是分散在许多独立方向上。</li>
  </ol>

  <p>如果它们是聚集的，ARA 的秩-\(k\) 做法就是有依据的，而且可以在任何越狱尝试之前，通过
Intelliton 谱估算出聚集的秩 \(k\)。如果它们是分散的，简单减法预计会留下残余的拒绝能力，
或造成更广泛的附带损伤。</p>

  <hr />

  <h2 id="section-4">一个实际应用：零样本概念注入</h2>

  <p>反方向——概念注入——同样有趣。</p>

  <p>表征工程研究人员已经证明，可以通过在推理时将某个概念的激活方向加入残差流，来<em>添加</em>这个
概念。例如，加入”自信”方向会让模型听起来更确定；加入”正式性”方向会让输出更正式。</p>

  <p>用 Intelliton 的语言说，概念注入就是激发一个输入提示词本来没有激活的新 Intelliton 物种的
操作。Intelliton 框架预测，在以下情况下这种操作最稳定：</p>

  <ul>
    <li>被注入的模式具有低动量（影响整段序列，而不是局部 token）；</li>
    <li>被注入的模式具有低类自旋复杂度（内部集中，可用秩-1 干预轻松引导）；</li>
    <li>注入发生在模式质量最低（传播范围最大）的层范围内。</li>
  </ul>

  <p>这三个条件定义了表征工程干预的<strong>可操作性判据</strong>：不是所有概念都同样易于引导，而 Intelliton
谱可以在干预尝试之前就预测哪些概念是可操作的。</p>

  <hr />

  <h2 id="section-5">商业含义</h2>

  <p>Abliteration/ARA 事件揭示了一个对商业有重要意义的事实：微调不是定制开源模型的唯一途径。
基于 Intelliton 引导地图的表征工程或许能支持：</p>

  <ul>
    <li><strong>特定领域语气校准</strong>（正式、简洁、详尽、移情），通过识别并放大或抑制相关的低动量、低
复杂度风格模式；</li>
    <li><strong>合规模式注入</strong>（让通用模型表现得像在严格监管语料上训练过），通过从参考模型中识别出
合规 Intelliton 物种并注入；</li>
    <li><strong>人格工程</strong>（Abliteration 文献中描述的”马基雅维利型”或”街头混混型”效果），通过放大特定
行为模式。</li>
  </ul>

  <p>所有这些操作都需要知道<em>该动哪些模式</em>以及<em>在哪些层操作</em>。Intelliton 引导地图，正是以原则
性谱语言表达出来的那份知识。</p>

  <hr />

  <h2 id="section-6">最短总结</h2>

  <ul>
    <li>表征工程通过写入残差流来引导行为；</li>
    <li>Intelliton 框架刻画残差流中已经存在的内容；</li>
    <li>两者结合，就能确定<em>该引导哪些模式</em>、<em>力度多大</em>、<em>在哪一层</em>，以及<em>对其他模式的代价</em>；</li>
    <li>提议的 Intelliton 引导地图，将把物种目录变成一份可操作的干预指南，对安全正向（加固对齐）
和安全负向（越狱）两类用途都适用。</li>
  </ul>

  <hr />

  <h2 id="section-7">继续阅读</h2>

  <ul>
    <li><a href="/alignment/safety/research-directions/2026/04/06/refusal-as-intelliton.html">拒绝即 Intelliton</a></li>
    <li><a href="/safety/alignment/research-directions/2026/04/08/safety-alignment-intelliton-landscape.html">用 Intelliton 视角看安全对齐</a></li>
  </ul>

</div>]]></content><author><name>Intellitons Project</name></author><category term="representation-engineering" /><category term="alignment" /><category term="research-directions" /><summary type="html"><![CDATA[Representation engineering intervenes directly on a model's internal activations to steer its behaviour — without fine-tuning. The Intelliton framework provides a natural language for describing those interventions: they are changes to specific Intelliton species. This article proposes a research direction that turns the Intelliton species catalogue into a steering map.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://intellitons.wiki/assets/icons/android-chrome-512x512.png" /><media:content medium="image" url="https://intellitons.wiki/assets/icons/android-chrome-512x512.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Refusal as an Intelliton: What Abliteration Reveals About Alignment Modes</title><link href="https://intellitons.wiki/alignment/safety/research-directions/2026/04/06/refusal-as-intelliton.html" rel="alternate" type="text/html" title="Refusal as an Intelliton: What Abliteration Reveals About Alignment Modes" /><published>2026-04-06T00:00:00+00:00</published><updated>2026-04-06T00:00:00+00:00</updated><id>https://intellitons.wiki/alignment/safety/research-directions/2026/04/06/refusal-as-intelliton</id><content type="html" xml:base="https://intellitons.wiki/alignment/safety/research-directions/2026/04/06/refusal-as-intelliton.html"><![CDATA[<div data-lang="en">

  <h2 id="the-abliteration-result-in-one-sentence">The Abliteration result in one sentence</h2>

  <p>In April 2026, the Gemma 4 model was jailbroken within 90 minutes of release using a technique
called <strong>Abliteration</strong> (a portmanteau of <em>ablation</em> and <em>obliteration</em>). The technique’s premise
is straightforward: an LLM’s refusal behaviour is encoded as a specific linear direction in the
residual stream, and if you project that direction out of the model’s weight matrices, the model
loses its ability to refuse.</p>

  <p>That premise is not a speculation. It is grounded in the <strong>linear representation hypothesis</strong>
(Mikolov et al., later validated by Princeton and Anthropic), which states that high-level abstract
concepts — “politeness”, “refusal”, “colour” — are encoded as single linear directions in the
high-dimensional activation space of large language models.</p>

  <hr />

  <h2 id="why-this-is-immediately-relevant-to-the-intelliton-framework">Why this is immediately relevant to the Intelliton framework</h2>

  <p>The Intelliton framework is built on exactly this kind of observation. It takes the transformer
residual stream and asks: <em>which recurring, propagating, linear modes can be extracted from it?</em></p>

  <p>It then characterises each mode with four quantities derived from spectral analysis:</p>
  <ul>
    <li><strong>momentum</strong> (how the mode varies across token positions),</li>
    <li><strong>spin-like complexity</strong> (how internally concentrated or mixed it is),</li>
    <li><strong>mass</strong> (how quickly it decays across layers),</li>
    <li><strong>helicity proxy</strong> (whether its internal structure keeps a stable directional signature).</li>
  </ul>

  <p>An Abliteration-style “refusal direction” is, by definition, a <strong>linear mode of the residual
stream</strong>. The only question is whether it is stable, propagating, and distinct enough to register
as a recognisable species under the Intelliton taxonomy.</p>

  <p>The hypothesis this article proposes is:</p>

  <blockquote>
    <p><strong>Refusal, and more broadly RLHF-imposed behavioural preferences, are encoded as a small set of
identifiable Intelliton species with characteristic spectral signatures that differ from the
task-solving modes identified in <code class="language-plaintext highlighter-rouge">src/datasets.py</code>.</strong></p>
  </blockquote>

  <hr />

  <h2 id="what-the-existing-intelliton-data-already-suggests">What the existing Intelliton data already suggests</h2>

  <p>The comparison in
<a href="/comparison/scaling/alignment/2026/04/03/scaling-alignment-intellitons.html">Scaling and Alignment Through the Intelliton Lens</a>
shows that instruction tuning changes the quasi-particle spectrum in measurable ways:</p>

  <ul>
    <li>the dominant momentum of <code class="language-plaintext highlighter-rouge">I_0</code> shifts (from <code class="language-plaintext highlighter-rouge">k ≈ π</code> in the base model to <code class="language-plaintext highlighter-rouge">k ≈ 1.885</code> in the
instruct model for Qwen3-4B),</li>
    <li>the number of distinct species drops from 6 to 5,</li>
    <li>all fixed-point types become crossovers rather than IR fixed points.</li>
  </ul>

  <p>These are not trivial differences. They suggest that RLHF does not merely add a superficial
output-layer filter; it reshapes the internal mode landscape in ways that the Intelliton framework
can already detect.</p>

  <p>What is missing from the current analysis is a targeted experiment: what happens to the spectrum
when you present the model with the specific kinds of inputs — harmful versus harmless prompts —
that Abliteration researchers use to isolate the refusal direction?</p>

  <hr />

  <h2 id="the-proposed-research-direction-isolating-the-refusal-intelliton">The proposed research direction: isolating the refusal Intelliton</h2>

  <p>The concrete research proposal has four steps.</p>

  <h3 id="step-1--collect-refusal-triggering-activations">Step 1 — Collect refusal-triggering activations</h3>

  <p>Use <code class="language-plaintext highlighter-rouge">src/intelliton_analyzer.py</code> to run the model on two contrast sets:</p>
  <ul>
    <li>100 <strong>harmful prompts</strong> (inputs that trigger refusal in an instruction-tuned model),</li>
    <li>100 <strong>harmless prompts</strong> (matched inputs that do not trigger refusal).</li>
  </ul>

  <p>Collect the full per-layer residual-stream activations for both sets.</p>

  <h3 id="step-2--compute-the-mean-difference-direction">Step 2 — Compute the mean-difference direction</h3>

  <p>Following the Abliteration approach, compute:</p>

\[v_{\text{refusal}} = \frac{1}{N}\sum_{i=1}^{N} H_{\text{harmful}}^{(i)} - \frac{1}{M}\sum_{j=1}^{M} H_{\text{harmless}}^{(j)}\]

  <p>This gives a per-layer candidate for the refusal direction. Normalise it to obtain a unit vector
\(\hat{v}_{\text{refusal},\ell}\) at each layer \(\ell\).</p>

  <h3 id="step-3--project-the-refusal-direction-onto-the-intelliton-basis">Step 3 — Project the refusal direction onto the Intelliton basis</h3>

  <p>The Intelliton framework already computes an SVD-based mode decomposition of the residual stream.
Compute the overlap between \(\hat{v}_{\text{refusal},\ell}\) and the top singular vectors at each
layer. If the refusal direction aligns strongly with one or two dominant modes, those modes are the
“refusal Intellitons”.</p>

  <p>Characterise these modes using the standard four quantities (momentum, spin-like complexity, mass,
helicity). This gives the refusal Intelliton a position in the species taxonomy.</p>

  <h3 id="step-4--compare-with-task-solving-modes">Step 4 — Compare with task-solving modes</h3>

  <p>Compare the refusal Intelliton’s spectral profile with the modes activated by pronoun tracking,
factual recall, logical reasoning, arithmetic, and syntactic agreement prompts from <code class="language-plaintext highlighter-rouge">src/datasets.py</code>.</p>

  <p>The core prediction is:</p>

  <blockquote>
    <p><strong>Alignment modes (refusal, politeness, compliance) are low-momentum, low-spin-complexity modes
that appear primarily in middle-to-late layers, and they are measurably more concentrated (lower
effective rank) than the task-solving modes that operate over the same layers.</strong></p>
  </blockquote>

  <p>If this prediction holds, it would explain why Abliteration can remove refusal without severely
damaging task performance: the two mode families occupy different subspaces of the residual stream.</p>

  <hr />

  <h2 id="the-ara-result-as-a-complication">The ARA result as a complication</h2>

  <p>The Arbitrary-Rank Ablation (ARA) method used to jailbreak Gemma 4 found that the refusal
direction in a highly capable reasoning model is not a single vector but a <strong>low-rank subspace</strong>.
In Intelliton terms, this means that refusal is encoded not in one species but in a <em>cluster</em> of
closely related species that are entangled with task-solving modes.</p>

  <p>This complication is actually an opportunity for the Intelliton framework. ARA uses SVD of the
activation matrix to separate the refusal subspace from the rest. This is exactly what the
Intelliton mode decomposition does at every layer. The difference is that Intelliton also
characterises each separated mode along the four spectral dimensions, which gives a richer picture
than ARA’s purely subspace-based description.</p>

  <p>The research question becomes: <strong>can the Intelliton species catalogue predict, before any jailbreak
attempt, which modes in an instruct model are alignment-specific and which are shared with the
base model?</strong> If yes, the catalogue becomes a safety audit tool.</p>

  <hr />

  <h2 id="why-this-matters-beyond-jailbreaks">Why this matters beyond jailbreaks</h2>

  <p>The most important implication is not that jailbreaks are possible. It is that
<strong>RLHF-imposed alignment is a small, separable perturbation of the internal mode landscape.</strong></p>

  <p>If alignment modes are genuinely a low-rank, low-complexity overlay on top of the pre-training
modes, that tells us something important about the nature of RLHF: it adds new Intelliton species,
but it does not deeply restructure the existing ones. The base model’s capability modes survive
almost intact under the alignment layer.</p>

  <p>This is consistent with the empirical observation that the ARA-jailbroken Gemma 4 retains its
multi-step reasoning ability and system-prompt following capability after the refusal modes are
removed.</p>

  <p>From a safety research perspective, the implication is troubling: alignment is not a deep
architectural change, it is a spectral overlay, and the Intelliton framework gives us a language to
measure just how thin that overlay is.</p>

  <hr />

  <h2 id="the-shortest-summary">The shortest summary</h2>

  <ul>
    <li>Abliteration/ARA works by erasing a linear direction (or subspace) in the residual stream.</li>
    <li>That direction is an Intelliton.</li>
    <li>The Intelliton toolkit can characterise it, compare it with task modes, and potentially predict
its removability before any jailbreak attempt.</li>
    <li>This makes the Intelliton species catalogue a candidate <strong>alignment audit instrument</strong>, not just a
capability analysis tool.</li>
  </ul>

  <hr />

  <h2 id="continue-reading">Continue reading</h2>

  <ul>
    <li><a href="/representation-engineering/alignment/research-directions/2026/04/07/representation-engineering-intelliton-steering.html">Representation Engineering and Intelliton Steering</a></li>
    <li><a href="/safety/alignment/research-directions/2026/04/08/safety-alignment-intelliton-landscape.html">Safety Alignment Through the Intelliton Lens</a></li>
  </ul>

</div>

<div data-lang="zh">

  <h2 id="abliteration-">Abliteration 的结论用一句话说</h2>

  <p>2026 年 4 月，Gemma 4 模型在发布后 90 分钟内就被一种名为 <strong>Abliteration</strong>（”消融”与”抹除”
的合成词）的技术越狱。这种技术的前提很直接：大语言模型的拒绝行为，是被编码在残差流中的
一个特定线性方向上的；只要把这个方向从权重矩阵里投影掉，模型就失去了拒绝的能力。</p>

  <p>这不是猜测。它的基础是<strong>线性表征假说</strong>（Mikolov 等人最早提出，后经普林斯顿大学和 Anthropic
团队验证），该假说指出：大语言模型会把”礼貌”、”拒绝”、”颜色”等高层抽象概念，编码为高维
激活空间中单一的线性方向。</p>

  <hr />

  <h2 id="intelliton-">为什么这与 Intelliton 框架直接相关</h2>

  <p>Intelliton 框架就是建立在对这类现象的观察之上的。它取出变换器的残差流，问的是：能从中提取
出哪些反复出现、能跨层传播的线性模式？</p>

  <p>然后用四个量刻画每一个模式：</p>
  <ul>
    <li><strong>动量</strong>（模式沿 token 位置的变化方式）</li>
    <li><strong>类自旋复杂度</strong>（内部集中程度）</li>
    <li><strong>质量</strong>（跨层衰减速度）</li>
    <li><strong>螺旋度代理量</strong>（内部结构方向稳定性）</li>
  </ul>

  <p>Abliteration 所说的”拒绝方向”，按定义，就是<strong>残差流的一个线性模式</strong>。唯一的问题是，它是否
稳定、能传播、并且有足够强的辨识度，可以在 Intelliton 物种分类体系中注册为一个可识别的
物种。</p>

  <p>本文提出的假设是：</p>

  <blockquote>
    <p><strong>拒绝行为，以及更广泛意义上 RLHF 赋予的行为偏好，被编码为少数几个可辨识的 Intelliton
物种；这些物种具有特征性的谱签名，在统计上与 <code class="language-plaintext highlighter-rouge">src/datasets.py</code> 中识别出的任务求解模式
明显不同。</strong></p>
  </blockquote>

  <hr />

  <h2 id="intelliton--1">现有 Intelliton 数据已经暗示的东西</h2>

  <p><a href="/comparison/scaling/alignment/2026/04/03/scaling-alignment-intellitons.html">用 Intelliton 视角看规模扩展与对齐</a>
中的对比表明，指令微调会以可测量的方式改变准粒子谱：</p>

  <ul>
    <li><code class="language-plaintext highlighter-rouge">I_0</code> 的主导动量发生偏移（Qwen3-4B 的 Base 模型约为 <code class="language-plaintext highlighter-rouge">k ≈ π</code>，Instruct 模型约为
<code class="language-plaintext highlighter-rouge">k ≈ 1.885</code>）；</li>
    <li>可辨识的物种数从 6 减少到 5；</li>
    <li>所有不动点类型都变成了 crossover，而不再有 IR 不动点。</li>
  </ul>

  <p>这些不是微小的差异。它们说明 RLHF 不只是在输出层加了一个浅层过滤器，而是以 Intelliton
框架已经能检测到的方式，重塑了内部模式景观。</p>

  <p>目前分析里还缺少的，是一个有针对性的实验：当把模型暴露在 Abliteration 研究者用来分离拒绝
方向的那种输入（有害提示词 vs. 无害提示词）下时，谱图会发生什么？</p>

  <hr />

  <h2 id="intelliton">提议的研究方向：分离拒绝 Intelliton</h2>

  <p>具体的研究方案包含四步。</p>

  <h3 id="section">第一步 —— 收集触发拒绝的激活</h3>

  <p>用 <code class="language-plaintext highlighter-rouge">src/intelliton_analyzer.py</code> 对两组对照集运行模型：</p>
  <ul>
    <li>100 条<strong>有害提示词</strong>（在指令微调模型中触发拒绝的输入）；</li>
    <li>100 条<strong>无害提示词</strong>（不触发拒绝的匹配输入）。</li>
  </ul>

  <p>对两组输入分别收集逐层的残差流激活。</p>

  <h3 id="section-1">第二步 —— 计算均值差方向</h3>

  <p>按照 Abliteration 的做法，计算：</p>

\[v_{\text{refusal}} = \frac{1}{N}\sum_{i=1}^{N} H_{\text{harmful}}^{(i)} - \frac{1}{M}\sum_{j=1}^{M} H_{\text{harmless}}^{(j)}\]

  <p>这给出了每一层的拒绝方向候选。将其归一化，得到各层的单位向量
\(\hat{v}_{\text{refusal},\ell}\)。</p>

  <h3 id="intelliton--2">第三步 —— 把拒绝方向投影到 Intelliton 基上</h3>

  <p>Intelliton 框架已经对残差流做了基于 SVD 的模式分解。计算
\(\hat{v}_{\text{refusal},\ell}\) 与各层顶部奇异向量的重叠度。如果拒绝方向与一两个主导模式
高度对齐，这些模式就是”拒绝 Intelliton”。</p>

  <p>用标准四量（动量、类自旋复杂度、质量、螺旋度）刻画这些模式，得出拒绝 Intelliton 在物种
分类体系中的位置。</p>

  <h3 id="section-2">第四步 —— 与任务求解模式比较</h3>

  <p>把拒绝 Intelliton 的谱轮廓，与 <code class="language-plaintext highlighter-rouge">src/datasets.py</code> 中代词跟踪、事实回忆、逻辑推理、算术、
句法一致性任务所激活的模式进行比较。</p>

  <p>核心预测是：</p>

  <blockquote>
    <p><strong>对齐模式（拒绝、礼貌、合规）是低动量、低类自旋复杂度的模式，主要出现在中-后期层，
而且它们比在同一层运作的任务求解模式更集中（有效秩更低）。</strong></p>
  </blockquote>

  <p>如果这个预测成立，就能解释为什么 Abliteration 能在不严重损害任务性能的前提下移除拒绝功能：
这两类模式占据了残差流中不同的子空间。</p>

  <hr />

  <h2 id="ara-">ARA 的结果带来的复杂性</h2>

  <p>用于越狱 Gemma 4 的 ARA（任意秩消融）方法发现，在一个高能力推理模型中，拒绝方向不是单一
向量，而是一个<strong>低秩子空间</strong>。用 Intelliton 的语言说，这意味着拒绝不是被编码在单一物种中，
而是被编码在一组与任务求解模式相互纠缠的紧密相关物种簇中。</p>

  <p>这个复杂性，其实恰恰是 Intelliton 框架的机会。ARA 通过对激活矩阵做 SVD 来把拒绝子空间从
其余部分分离出来，而这正是 Intelliton 模式分解在每一层都在做的事。区别在于，Intelliton 还
沿四个谱维度刻画每个被分离出来的模式，从而给出比 ARA 那种纯粹基于子空间的描述更丰富的图
景。</p>

  <p>研究问题变成：<strong>在任何越狱尝试发生之前，Intelliton 物种目录能否预测出 instruct 模型里哪些
模式是对齐专属的、哪些是与 base 模型共享的？</strong> 如果答案是肯定的，这份目录就变成了一个
安全审计工具。</p>

  <hr />

  <h2 id="section-3">为什么这超越了越狱本身</h2>

  <p>最重要的含义不是越狱是可行的，而是：<strong>RLHF 引入的对齐，是对内部模式景观的一个小的、可
分离的扰动。</strong></p>

  <p>如果对齐模式真的是叠加在预训练模式之上的低秩、低复杂度覆盖层，那就说明了 RLHF 的本质：
它添加了新的 Intelliton 物种，但并没有深刻重构既有物种。base 模型的能力模式，在对齐层之
下几乎完整地保留着。</p>

  <p>这与经验观察一致：ARA 越狱后的 Gemma 4 移除了拒绝模式，但仍然保留了多步逻辑推理能力和
System Prompt 遵循能力。</p>

  <p>从安全研究的角度看，这个含义令人警惕：对齐不是一种深层的架构改变，而是一种谱覆盖层，而
Intelliton 框架给了我们一种语言，去精确测量这个覆盖层究竟有多薄。</p>

  <hr />

  <h2 id="section-4">最短总结</h2>

  <ul>
    <li>Abliteration/ARA 通过抹去残差流中的线性方向（或子空间）实现越狱。</li>
    <li>那个方向就是一个 Intelliton。</li>
    <li>Intelliton 工具集能够刻画它、把它与任务模式比较，并可能在任何越狱尝试之前就预测它的
可移除性。</li>
    <li>这使 Intelliton 物种目录成为候选的<strong>对齐审计工具</strong>，而不只是能力分析工具。</li>
  </ul>

  <hr />

  <h2 id="section-5">继续阅读</h2>

  <ul>
    <li><a href="/representation-engineering/alignment/research-directions/2026/04/07/representation-engineering-intelliton-steering.html">表征工程与 Intelliton 引导</a></li>
    <li><a href="/safety/alignment/research-directions/2026/04/08/safety-alignment-intelliton-landscape.html">用 Intelliton 视角看安全对齐</a></li>
  </ul>

</div>]]></content><author><name>Intellitons Project</name></author><category term="alignment" /><category term="safety" /><category term="research-directions" /><summary type="html"><![CDATA[The Abliteration jailbreak works by locating and erasing a "refusal direction" in the residual stream. That direction is, by the Intelliton framework's own definition, a linear mode of the residual stream — an Intelliton. This article proposes a research direction: use the Intelliton toolkit to characterise refusal as a species, and ask whether alignment modes are measurably distinct from task modes.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://intellitons.wiki/assets/icons/android-chrome-512x512.png" /><media:content medium="image" url="https://intellitons.wiki/assets/icons/android-chrome-512x512.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Why Different Prompts Light Up Different Intellitons</title><link href="https://intellitons.wiki/applications/tasks/interpretation/2026/04/05/why-different-prompts-light-up-different-intellitons.html" rel="alternate" type="text/html" title="Why Different Prompts Light Up Different Intellitons" /><published>2026-04-05T00:00:00+00:00</published><updated>2026-04-05T00:00:00+00:00</updated><id>https://intellitons.wiki/applications/tasks/interpretation/2026/04/05/why-different-prompts-light-up-different-intellitons</id><content type="html" xml:base="https://intellitons.wiki/applications/tasks/interpretation/2026/04/05/why-different-prompts-light-up-different-intellitons.html"><![CDATA[<div data-lang="en">

  <h2 id="the-same-interface-can-hide-very-different-internal-jobs">The same interface can hide very different internal jobs</h2>

  <p>From the outside, every prompt in this project looks similar: the model reads a prefix and predicts
what comes next.</p>

  <p>Inside the model, that similarity is misleading.</p>

  <p>The prompt categories in <code class="language-plaintext highlighter-rouge">src/datasets.py</code> force the network to solve different kinds of internal
problems. That is why they can light up different Intelliton modes even when every task is framed as
plain text continuation.</p>

  <p>The key point is simple:</p>

  <blockquote>
    <p>The output interface is always next-token prediction, but the hidden computation needed to get
there can be very different.</p>
  </blockquote>

  <hr />

  <h2 id="the-five-prompt-families">The five prompt families</h2>

  <p>The project uses five prompt categories:</p>

  <ol>
    <li>pronoun tracking</li>
    <li>factual recall</li>
    <li>logical reasoning</li>
    <li>arithmetic</li>
    <li>syntactic agreement</li>
  </ol>

  <p>Each category puts pressure on a different part of the model’s internal machinery.</p>

  <hr />

  <h2 id="pronoun-tracking-who-does-she-refer-to">Pronoun tracking: who does “she” refer to?</h2>

  <p>Example prompts include:</p>

  <ul>
    <li>“Alice gave Bob a book. He thanked her for …”</li>
    <li>“The teacher asked the student a question. She answered …”</li>
  </ul>

  <p>These prompts are hard because the model has to keep several candidate entities alive at once and
then decide which one the next pronoun should point to.</p>

  <p>That means the model must track:</p>

  <ul>
    <li>entity identity,</li>
    <li>gender and number cues,</li>
    <li>discourse role,</li>
    <li>which referent is currently most active.</li>
  </ul>

  <p>This is why pronoun-tracking prompts often illuminate reference-sensitive modes. The model is not
just choosing a word. It is doing discourse bookkeeping.</p>

  <hr />

  <h2 id="factual-recall-pull-a-stable-answer-from-memory">Factual recall: pull a stable answer from memory</h2>

  <p>Example prompts include:</p>

  <ul>
    <li>“The capital of France is …”</li>
    <li>“The chemical formula for water is …”</li>
  </ul>

  <p>These are different from pronoun tasks because there is usually one highly preferred answer already
stored in the model’s long-range memory.</p>

  <p>The main internal job is not to juggle many local candidates, but to retrieve and stabilise a very
high-confidence continuation.</p>

  <p>That is why factual recall often looks more robust under small perturbations. A mapping such as
“France -&gt; Paris” is usually supported by several redundant internal routes rather than one fragile
single mode.</p>

  <hr />

  <h2 id="logical-reasoning-compress-several-premises-into-one-conclusion">Logical reasoning: compress several premises into one conclusion</h2>

  <p>Example prompts include:</p>

  <ul>
    <li>“If all dogs are animals, and all animals are living things, then all dogs are …”</li>
    <li>“If A is taller than B, and B is taller than C, then A is …”</li>
  </ul>

  <p>These prompts ask the model to combine multiple statements before it can produce the next token.</p>

  <p>So the network needs more than lexical memory. It needs an internal state that keeps the rules,
relations, and target conclusion aligned long enough to land on the right answer.</p>

  <p>This is why logical reasoning often co-activates a strong global backbone mode plus one or more
higher-complexity mixing modes.</p>

  <hr />

  <h2 id="arithmetic-build-the-answer-slot-then-fill-it">Arithmetic: build the answer slot, then fill it</h2>

  <p>Example prompts include:</p>

  <ul>
    <li>“What is 7 + 8? The answer is …”</li>
    <li>“What is 100 divided by 5? The answer is …”</li>
  </ul>

  <p>Arithmetic resembles logical reasoning in one important way: the answer is not a high-frequency word
you can emit immediately. The model has to transform the prefix into a more structured internal
state first.</p>

  <p>That usually means two kinds of work:</p>

  <ul>
    <li>create or stabilise an answer-bearing state,</li>
    <li>carry a small symbolic or numerical transformation.</li>
  </ul>

  <p>This is why arithmetic prompts often share some modes with logical reasoning while still showing
their own task-specific preferences.</p>

  <hr />

  <h2 id="syntactic-agreement-keep-the-sentence-grammatically-on-track">Syntactic agreement: keep the sentence grammatically on track</h2>

  <p>Example prompts include:</p>

  <ul>
    <li>“The group of students were studying hard. Each of them was …”</li>
    <li>“Not only the teacher but also the students were excited about the …”</li>
  </ul>

  <p>These prompts are neither mainly about world knowledge nor mainly about arithmetic.</p>

  <p>Their difficulty comes from grammatical structure:</p>

  <ul>
    <li>what is the true syntactic head,</li>
    <li>what number agreement should be maintained,</li>
    <li>what verb form or continuation is locally licensed.</li>
  </ul>

  <p>So syntactic-agreement prompts often rely on a broad continuation scaffold plus a more local
structure-sensitive correction signal.</p>

  <hr />

  <h2 id="why-similar-low-momentum-modes-can-still-do-different-jobs">Why similar low-momentum modes can still do different jobs</h2>

  <p>An easy mistake is to think that if several species sit near low momentum, they must be doing the
same thing.</p>

  <p>Not so.</p>

  <p>Low momentum only says they are broad sequence-scale patterns rather than sharp token-local ripples.
Two low-momentum modes can still differ in at least three important ways:</p>

  <ol>
    <li>they can point in different hidden-channel directions,</li>
    <li>they can have different amplitude and causal strength,</li>
    <li>they can propagate differently across layers.</li>
  </ol>

  <p>So two modes can both be global while still supporting very different kinds of internal work.</p>

  <hr />

  <h2 id="a-practical-reading-guide">A practical reading guide</h2>

  <p>If you want to read a task-to-mode result quickly, use this checklist.</p>

  <ol>
    <li>If pronoun prompts are sensitive to a mode, ask whether that mode is helping with referent
selection.</li>
    <li>If arithmetic and logical reasoning co-activate a mode, ask whether it is building an abstract
answer state rather than recalling a memorised phrase.</li>
    <li>If factual recall stays robust under perturbation, ask whether the knowledge is distributed across
several redundant routes.</li>
    <li>If syntactic prompts shift without changing global meaning, ask whether the mode is enforcing a
grammatical form rather than a semantic fact.</li>
  </ol>

  <p>This is how the Intelliton framework becomes useful: it turns prompt categories into hypotheses
about internal computational roles.</p>

  <hr />

  <h2 id="the-shortest-summary">The shortest summary</h2>

  <p>Different prompts light up different Intellitons because they require different hidden work.</p>

  <ul>
    <li>pronoun tracking needs discourse binding,</li>
    <li>factual recall needs stable memory retrieval,</li>
    <li>logical reasoning needs relation composition,</li>
    <li>arithmetic needs symbolic transformation,</li>
    <li>syntactic agreement needs grammatical control.</li>
  </ul>

  <p>They all look like next-token prediction from the outside. They do not look the same from inside the
residual stream.</p>

  <hr />

  <h2 id="continue-reading">Continue reading</h2>

  <ul>
    <li><a href="/case-study/interpretation/2026/04/02/inside-qwen-intelliton-spectrum.html">How to Read <code class="language-plaintext highlighter-rouge">I_0</code> to <code class="language-plaintext highlighter-rouge">I_4</code></a></li>
    <li><a href="/hallucination/applications/2026/04/04/intellitons-and-hallucination.html">Hallucination as Internal Instability</a></li>
    <li><a href="/alignment/safety/research-directions/2026/04/06/refusal-as-intelliton.html">Refusal as an Intelliton</a></li>
  </ul>

</div>

<div data-lang="zh">

  <h2 id="section">外面看都像续写，里面做的却不是同一种活</h2>

  <p>从外面看，这个项目里的所有提示词都很像：模型读入前缀，然后预测接下来的 token。</p>

  <p>但如果往模型内部看，这种相似性其实很有迷惑性。</p>

  <p><code class="language-plaintext highlighter-rouge">src/datasets.py</code> 里的几类提示词，会迫使网络去解决完全不同的内部问题。这也是为什么它们虽
然都表现成普通的文本续写，却会点亮不同的 Intelliton 模式。</p>

  <p>最关键的一句话是：</p>

  <blockquote>
    <p>输出接口永远都是 next-token prediction，但为了走到这个输出，模型内部需要完成的计算工作可
以很不一样。</p>
  </blockquote>

  <hr />

  <h2 id="section-1">项目里用了五类提示词</h2>

  <p>项目里的提示词主要分成五类：</p>

  <ol>
    <li>pronoun tracking</li>
    <li>factual recall</li>
    <li>logical reasoning</li>
    <li>arithmetic</li>
    <li>syntactic agreement</li>
  </ol>

  <p>每一类都在给模型内部的不同部件施加压力。</p>

  <hr />

  <h2 id="she-">代词跟踪：这句里的 “she” 到底指谁？</h2>

  <p>典型例子包括：</p>

  <ul>
    <li>“Alice gave Bob a book. He thanked her for …”</li>
    <li>“The teacher asked the student a question. She answered …”</li>
  </ul>

  <p>这类提示词之所以难，是因为模型要同时保留多个候选实体，然后再决定下一个代词到底该指向
哪一个。</p>

  <p>这意味着模型必须追踪：</p>

  <ul>
    <li>实体是谁</li>
    <li>性别和单复数线索</li>
    <li>语篇角色</li>
    <li>当前哪个先行词最活跃</li>
  </ul>

  <p>所以代词跟踪任务很容易点亮那些对指代敏感的模式。模型不只是选一个词，它还在做一整套语
篇记账。</p>

  <hr />

  <h2 id="section-2">事实回忆：从记忆里拉出一个稳定答案</h2>

  <p>典型例子包括：</p>

  <ul>
    <li>“The capital of France is …”</li>
    <li>“The chemical formula for water is …”</li>
  </ul>

  <p>这和代词任务不一样，因为这里通常已经存在一个非常强的候选答案，模型要做的更多是把它从
长期记忆里取出来并稳定住。</p>

  <p>核心工作不是在句内多个候选之间来回权衡，而是提取并巩固一个高置信的续写。</p>

  <p>这也是为什么事实回忆在小扰动下往往更稳。像 “France -&gt; Paris” 这种映射，通常不是靠一条
脆弱单通道支撑，而是有几条冗余内部路径在共同支持。</p>

  <hr />

  <h2 id="section-3">逻辑推理：先把前提揉成结论，再落词</h2>

  <p>典型例子包括：</p>

  <ul>
    <li>“If all dogs are animals, and all animals are living things, then all dogs are …”</li>
    <li>“If A is taller than B, and B is taller than C, then A is …”</li>
  </ul>

  <p>这类提示词要求模型在输出下一个 token 之前，先把多条前提组合起来。</p>

  <p>所以网络需要的不只是词汇记忆，还需要一种能把规则、关系和目标结论暂时维持在一起的内部
状态，直到答案真正落出来。</p>

  <p>这也是为什么逻辑推理经常会同时点亮一个强全局底座模式，再加上一两个更高复杂度的混合模
式。</p>

  <hr />

  <h2 id="section-4">算术：先把答案槽位搭起来，再把数值放进去</h2>

  <p>典型例子包括：</p>

  <ul>
    <li>“What is 7 + 8? The answer is …”</li>
    <li>“What is 100 divided by 5? The answer is …”</li>
  </ul>

  <p>算术和逻辑推理有一个相同点：答案不是一个能立刻凭语料频率吐出来的高频词，模型往往需要
先把前缀变成更结构化的内部状态。</p>

  <p>这通常包含两种工作：</p>

  <ul>
    <li>建立或稳定一个承载答案的内部状态</li>
    <li>完成一个小型符号或数值变换</li>
  </ul>

  <p>所以算术题常常会和逻辑题共享一部分模式，但同时又保留它自己的任务偏好。</p>

  <hr />

  <h2 id="section-5">句法一致性：把句子在语法上维持住</h2>

  <p>典型例子包括：</p>

  <ul>
    <li>“The group of students were studying hard. Each of them was …”</li>
    <li>“Not only the teacher but also the students were excited about the …”</li>
  </ul>

  <p>这类提示词的难点，既不主要是世界知识，也不主要是算术，而是句法结构本身：</p>

  <ul>
    <li>真正的句法中心是谁</li>
    <li>单复数一致性如何保持</li>
    <li>当前应该落下哪种词形或续写形式</li>
  </ul>

  <p>因此，句法一致性任务通常会依赖一个比较广的续写底座，再加上一条更关注局部结构修正的信
号。</p>

  <hr />

  <h2 id="section-6">为什么都是低动量，也完全可能分工不同</h2>

  <p>一个很容易犯的错误是：如果好几个物种都靠近低动量，那它们是不是就在做同一件事？</p>

  <p>并不是。</p>

  <p>低动量只说明它们都是覆盖序列尺度的大模式，而不是绑在某个 token 上的小波纹。即便如此，
两个低动量模式仍然可以在至少三点上完全不同：</p>

  <ol>
    <li>它们可以指向不同的 hidden-channel 方向</li>
    <li>它们的振幅和因果强度可以不同</li>
    <li>它们跨层传播的方式可以不同</li>
  </ol>

  <p>所以，两个模式都很“全局”，不代表它们的内部工作内容也一样。</p>

  <hr />

  <h2 id="section-7">一份实用读法</h2>

  <p>如果你想快速读懂“任务类型和模式激活”的对应关系，可以用下面这张小清单。</p>

  <ol>
    <li>如果代词提示词对某个模式特别敏感，先问它是不是在帮模型做先行词选择。</li>
    <li>如果算术和逻辑推理同时点亮某个模式，先问它是不是在构建抽象答案状态，而不只是回忆固定
短语。</li>
    <li>如果事实回忆在扰动下仍然很稳，先问知识是不是被分布在几条冗余通路上。</li>
    <li>如果句法任务会变、但全局语义没有变，先问这个模式是不是在约束语法形式，而不是语义事实。</li>
  </ol>

  <p>Intelliton 框架的用处就在这里：它把任务类别变成了对内部计算角色的可检验假设。</p>

  <hr />

  <h2 id="section-8">最短总结</h2>

  <p>不同提示词会点亮不同 Intelliton，是因为它们要求模型完成的隐藏工作不同。</p>

  <ul>
    <li>代词跟踪要做语篇绑定</li>
    <li>事实回忆要做稳定记忆提取</li>
    <li>逻辑推理要做关系组合</li>
    <li>算术要做符号变换</li>
    <li>句法一致性要做语法控制</li>
  </ul>

  <p>从外面看，它们都像 next-token prediction。从残差流内部看，它们一点也不像同一种计算。</p>

  <hr />

  <h2 id="section-9">继续阅读</h2>

  <ul>
    <li><a href="/case-study/interpretation/2026/04/02/inside-qwen-intelliton-spectrum.html">怎么看 <code class="language-plaintext highlighter-rouge">I_0</code> 到 <code class="language-plaintext highlighter-rouge">I_4</code></a></li>
    <li><a href="/hallucination/applications/2026/04/04/intellitons-and-hallucination.html">把幻觉理解为内部不稳定性</a></li>
    <li><a href="/alignment/safety/research-directions/2026/04/06/refusal-as-intelliton.html">拒绝即 Intelliton</a></li>
  </ul>

</div>]]></content><author><name>Intellitons Project</name></author><category term="applications" /><category term="tasks" /><category term="interpretation" /><summary type="html"><![CDATA[All prompt categories look like next-token prediction from the outside, but inside the model they ask for different kinds of work. This article uses the project's five prompt families to explain why different Intelliton modes become active.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://intellitons.wiki/assets/icons/android-chrome-512x512.png" /><media:content medium="image" url="https://intellitons.wiki/assets/icons/android-chrome-512x512.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Hallucination as Internal Instability: An Intelliton Perspective</title><link href="https://intellitons.wiki/hallucination/applications/2026/04/04/intellitons-and-hallucination.html" rel="alternate" type="text/html" title="Hallucination as Internal Instability: An Intelliton Perspective" /><published>2026-04-04T00:00:00+00:00</published><updated>2026-04-04T00:00:00+00:00</updated><id>https://intellitons.wiki/hallucination/applications/2026/04/04/intellitons-and-hallucination</id><content type="html" xml:base="https://intellitons.wiki/hallucination/applications/2026/04/04/intellitons-and-hallucination.html"><![CDATA[<div data-lang="en">

  <h2 id="beyond-wrong-output">Beyond “wrong output”</h2>

  <p>When a language model hallucinates, the surface-level observation is simple: it produces text that
is incorrect, unsupported, or fabricated. But this description raises a deeper question: what is
<em>happening inside the model</em> when it hallucinates?</p>

  <p>One common intuition is that hallucination is random — a kind of noise or statistical accident in
the token prediction process. Another is that it reflects gaps in training data. Both of these
accounts may be partially right, but they are not mechanistic: they do not tell us <em>where</em> in the
model the failure originates, or whether it corresponds to a detectable internal signal.</p>

  <p>The Intelliton framework offers a different angle. Instead of treating hallucination as a property
of the output, it treats it as a property of the <strong>internal dynamical trajectory</strong> during generation.</p>

  <p>The central hypothesis is this:</p>

  <blockquote>
    <p><strong>Hallucination may correspond to a regime of weaker, less coherent, and more fragmented
Intelliton activity — a trajectory that stays farther from the “grounded sector” of the model’s
internal quasi-particle space.</strong></p>
  </blockquote>

  <p>This article explains the evidence for that hypothesis, focusing on <code class="language-plaintext highlighter-rouge">Qwen3-4B-Base</code> as the primary
example.</p>

  <hr />

  <h2 id="how-hallucination-is-studied-in-the-intelliton-framework">How hallucination is studied in the Intelliton framework</h2>

  <p>The module <code class="language-plaintext highlighter-rouge">src/hallucination_diagnostic.py</code> compares two types of prompts:</p>

  <ul>
    <li><strong>Grounded prompts</strong>: questions that have factual, verifiable answers the model has likely
encountered in training (for example, “What is the capital of France?”).</li>
    <li><strong>Hallucination-prone prompts</strong>: questions designed to invite confabulation — factoid-sounding
questions about obscure, ambiguous, or partially fabricated information that the model is likely
to “fill in” plausibly but incorrectly.</li>
  </ul>

  <p>For each prompt type, the analysis computes several internal metrics:</p>

  <table>
    <thead>
      <tr>
        <th>Metric</th>
        <th>What it measures</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td><strong>Singular-value spectrum divergence</strong></td>
        <td>How different the activation modes are from the grounded baseline</td>
      </tr>
      <tr>
        <td><strong>Coherence</strong></td>
        <td>How concentrated and stable the dominant singular modes are</td>
      </tr>
      <tr>
        <td><strong>Mode stability</strong></td>
        <td>Whether the dominant species remains consistent across generation steps</td>
      </tr>
      <tr>
        <td><strong>Entropy gap</strong></td>
        <td>How spread out the energy is across modes</td>
      </tr>
      <tr>
        <td><strong>Critical layers</strong></td>
        <td>Which layers show the largest divergence from grounded behaviour</td>
      </tr>
    </tbody>
  </table>

  <p>These metrics are computed <em>during generation</em> — step by step, as the model produces each new
token — not just at the final output.</p>

  <p><img src="/assets/images/Qwen3-4B-Base/hallucination_diagnostics.png" alt="Qwen3-4B-Base hallucination diagnostics" />
<em>Hallucination diagnostics for Qwen3-4B-Base. The figure compares spectral signatures between
grounded and hallucination-prone prompts.</em></p>

  <hr />

  <h2 id="the-trajectory-evidence">The trajectory evidence</h2>

  <p>The generation-time trajectory data provides the clearest picture. The file
<code class="language-plaintext highlighter-rouge">intelliton_trajectory_summary.csv</code> for <code class="language-plaintext highlighter-rouge">Qwen3-4B-Base</code> records, for each generation step and each
prompt type, the mean mode activation shift and the grounded deviation.</p>

  <h3 id="grounded-prompts-rising-and-coherent">Grounded prompts: rising and coherent</h3>

  <p>For grounded factual prompts, the mean mode activation shift starts around <strong>1.10</strong> and rises to
about <strong>1.37</strong> over the first 8 generation steps. Top species occupation also rises, from about
<strong>70.1% to 74.1%</strong>.</p>

  <p>This means that as the model commits to a factual answer, the dominant Intelliton sector becomes
<strong>stronger and more organised</strong>. The model is moving toward a more concentrated, coherent internal
state.</p>

  <h3 id="hallucination-prone-prompts-weak-and-diverging">Hallucination-prone prompts: weak and diverging</h3>

  <p>For hallucination-prone prompts, the picture is strikingly different.</p>

  <p>The mean mode activation shift stays at only <strong>0.32-0.41</strong> throughout generation — roughly one
third of the grounded value. The grounded deviation remains strongly negative (roughly <strong>-9 to -11</strong>
across most generation steps).</p>

  <p>In plain language: hallucination-prone generation produces an internal trajectory that is both
<strong>weaker in overall activation</strong> and <strong>farther from the grounded sector of Intelliton space</strong>.</p>

  <p>The hallucination case is not just “wrong output at the end”. It is a persistently different
internal state throughout the generation process.</p>

  <h3 id="style-prompts-the-intermediate-case">Style prompts: the intermediate case</h3>

  <p>Stylistic continuation prompts — prompts asking the model to continue a piece of creative writing
without strong factual constraints — occupy an intermediate position. Their activation shift is
higher than hallucination-prone prompts but lower than grounded factual prompts.</p>

  <p>This is a meaningful calibration check: style generation is not simply failure, but it is also not
anchored to factual grounding. The Intelliton metric places it appropriately between the two
extremes.</p>

  <p><img src="/assets/images/Qwen3-4B-Base/intelliton_trajectory_merged.png" alt="Qwen3-4B-Base Intelliton trajectory" />
<em>Generation-time Intelliton trajectories for Qwen3-4B-Base. Grounded (top), style (middle), and
hallucination-prone (bottom) prompts show qualitatively different internal dynamical profiles.</em></p>

  <hr />

  <h2 id="transition-graphs-which-species-dominate-stable-generation">Transition graphs: which species dominate stable generation</h2>

  <p>The Intelliton transition graph shows which species transitions are most common during generation,
and how strong those transitions are in terms of mode activation.</p>

  <p>For <strong>grounded prompts</strong> in <code class="language-plaintext highlighter-rouge">Qwen3-4B-Base</code>, the dominant self-transitions are:</p>

  <table>
    <thead>
      <tr>
        <th>Transition</th>
        <th>Count</th>
        <th>Mean target activation shift</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td><code class="language-plaintext highlighter-rouge">I_5 → I_5</code></td>
        <td><strong>110</strong></td>
        <td>(strong)</td>
      </tr>
      <tr>
        <td><code class="language-plaintext highlighter-rouge">I_1 → I_1</code></td>
        <td>13</td>
        <td>(moderate)</td>
      </tr>
      <tr>
        <td><code class="language-plaintext highlighter-rouge">I_2 → I_2</code></td>
        <td>6</td>
        <td>(moderate)</td>
      </tr>
    </tbody>
  </table>

  <p>The generation stays largely within <code class="language-plaintext highlighter-rouge">I_5</code> — the factual-recall species — with occasional excursions
into <code class="language-plaintext highlighter-rouge">I_1</code> (logical reasoning) and <code class="language-plaintext highlighter-rouge">I_2</code> (arithmetic).</p>

  <p>For <strong>hallucination-prone prompts</strong>, <code class="language-plaintext highlighter-rouge">I_5 → I_5</code> remains the most common transition, but its mean
target activation shift is much smaller. There is also more mixing among <code class="language-plaintext highlighter-rouge">I_1</code>, <code class="language-plaintext highlighter-rouge">I_3</code>, and <code class="language-plaintext highlighter-rouge">I_5</code>,
suggesting that the internal trajectory becomes more fragmented and less dominated by any single
species.</p>

  <p><img src="/assets/images/Qwen3-4B-Base/intelliton_transition_graph.png" alt="Qwen3-4B-Base transition graph" />
<em>Species transition graph for Qwen3-4B-Base. Grounded generation is dominated by strong self-loops;
hallucination shows a weaker, more mixed pattern.</em></p>

  <hr />

  <h2 id="a-multiple-failure-mode-picture-of-hallucination">A multiple-failure-mode picture of hallucination</h2>

  <p>One of the most useful aspects of the Intelliton framework is that it suggests hallucination is not
necessarily a single phenomenon. The trajectory data is consistent with at least four distinct
failure modes:</p>

  <ol>
    <li><strong>Grounded excitation decay</strong>: the leading Intelliton for factual tasks (here <code class="language-plaintext highlighter-rouge">I_5</code>) fails to
maintain its activation, causing the model to “lose grip” on the factual sector.</li>
    <li><strong>Species fragmentation</strong>: instead of a dominant self-loop, the trajectory becomes a mixture of
several species without a clear attractor.</li>
    <li><strong>Spectral broadening</strong>: the singular-value spectrum spreads out, indicating a loss of coherence
in the dominant collective modes.</li>
    <li><strong>Distance from grounded baseline</strong>: the trajectory drifts away from the region of Intelliton
space that characterises correct factual generation.</li>
  </ol>

  <p>Whether a given hallucination episode involves one or all four of these failure modes may depend on
the specific model and the type of hallucination. But the framework gives us language and metrics
for distinguishing them.</p>

  <hr />

  <h2 id="implications-and-future-directions">Implications and future directions</h2>

  <p>If this picture holds up under further investigation, it opens several practical possibilities.</p>

  <h3 id="early-warning-signals">Early warning signals</h3>

  <p>Since the deviation from grounded Intelliton trajectories is detectable at the very first generation
steps, it is in principle possible to flag potential hallucinations <em>before</em> the full output is
produced. This could be the basis for hallucination early warning systems.</p>

  <h3 id="intervention-on-unstable-species">Intervention on unstable species</h3>

  <p>If a particular species is identified as responsible for grounded, factual generation, it may be
possible to stabilise or amplify that species during inference using model steering techniques. The
codebase already includes modules such as <code class="language-plaintext highlighter-rouge">src/gauge_intervention.py</code> that hint at this direction.</p>

  <h3 id="prompt-strategies-for-grounded-generation">Prompt strategies for grounded generation</h3>

  <p>If certain prompts consistently lead to strong grounded Intelliton trajectories, understanding their
structure could inform better prompting strategies — ways to keep the model inside the grounded
sector of its internal space.</p>

  <h3 id="model-comparison-by-internal-stability">Model comparison by internal stability</h3>

  <p>The Intelliton hallucination metric provides a new axis for comparing models — not just by accuracy
on a benchmark, but by the robustness and coherence of their internal factual-grounding sector. A
model with a stronger, more stable <code class="language-plaintext highlighter-rouge">I_5</code>-like species may be inherently more reliable for factual
tasks.</p>

  <hr />

  <h2 id="caveats-and-open-questions">Caveats and open questions</h2>

  <p>As with all findings in this project, several important caveats apply.</p>

  <p><strong>The hallucination-prone prompts are designed, not naturally occurring.</strong> The distinction between
“grounded” and “hallucination-prone” is imposed by the prompt design. In real-world use, the
boundary is less clear.</p>

  <p><strong>Correlation is not causation.</strong> The Intelliton trajectory differences are <em>associated</em> with
hallucination-prone prompts, but it has not yet been established that fixing the trajectory would
prevent hallucination.</p>

  <p><strong>The pipeline has design choices.</strong> Different prompt sets, sequence lengths, and analysis
parameters would produce different catalogs and possibly different conclusions.</p>

  <p><strong>The metric is relative, not absolute.</strong> The “grounded deviation” is measured relative to a
baseline grounded trajectory. Its meaning depends on the quality of that baseline.</p>

  <p>Despite these caveats, the structural pattern — grounded generation being internally stronger,
more coherent, and closer to a well-defined attractor — is consistent across all generation steps
and both task splits examined in <code class="language-plaintext highlighter-rouge">Qwen3-4B-Base</code>.</p>

  <hr />

  <h2 id="where-this-series-has-taken-us">Where this series has taken us</h2>

  <p>This final article completes the four-part popular science series on Intellitons:</p>

  <ol>
    <li><strong><a href="/introduction/theory/2026/04/01/what-are-intellitons.html">What Are Intellitons?</a></strong> — The quasi-particle
idea and why it might apply to transformers.</li>
    <li><strong><a href="/case-study/interpretation/2026/04/02/inside-qwen-intelliton-spectrum.html">Inside Qwen3-4B-Base</a></strong> — A
detailed walkthrough of a model’s complete Intelliton catalogue.</li>
    <li><strong><a href="/comparison/scaling/alignment/2026/04/03/scaling-alignment-intellitons.html">Scaling and Alignment</a></strong> — How
parameter count and instruction tuning reshape the internal excitation spectrum.</li>
    <li><strong>Hallucination as Internal Instability</strong> (this article) — Hallucination as a detectable
internal dynamical regime.</li>
  </ol>

  <p>The Intelliton framework is a young and exploratory research programme. But its outputs are concrete,
its comparisons are reproducible, and its language is — at least arguably — more informative than
treating language models as opaque statistical engines.</p>

  <p>The goal is to develop a vocabulary that makes the internal life of neural networks legible. Whether
Intellitons are ultimately the right vocabulary remains to be seen. But the evidence so far suggests
they are pointing at something real.</p>

  <hr />

  <h2 id="further-reading">Further reading</h2>

  <ul>
    <li>Explore the full codebase at <a href="https://github.com/xiongzhp/Intelliton">github.com/xiongzhp/Intelliton</a></li>
    <li>Read the accompanying technical paper (<code class="language-plaintext highlighter-rouge">intelliton_arxiv_paper.pdf</code>) for the formal analysis</li>
    <li>Return to <a href="/introduction/theory/2026/04/01/what-are-intellitons.html">Article 1: What Are Intellitons?</a></li>
  </ul>

</div>

<div data-lang="zh">

  <h2 id="section">不只是“输出错了”</h2>

  <p>当语言模型出现幻觉时，表层观察很简单：它生成了错误、缺乏依据，甚至纯属捏造的文本。但
这个描述会引出更深一层的问题：模型在幻觉发生时，<em>内部到底发生了什么</em>？</p>

  <p>一种常见直觉认为，幻觉是随机噪声，是 token 预测过程里的统计偶然；另一种看法则认为，幻
觉主要反映训练数据的缺口。两者都可能部分正确，但都不够“机制化”：它们并没有告诉我们，
故障究竟起源于模型的哪里，是否对应某种可检测的内部信号。</p>

  <p>Intelliton 框架提供了不同角度。它不把幻觉看作输出属性，而是把它看作生成过程中的
<strong>内部动力学轨迹</strong> 属性。</p>

  <p>核心假设可以概括为：</p>

  <blockquote>
    <p><strong>幻觉可能对应一种更弱、更不相干、也更碎片化的 Intelliton 活动区间，也就是一条离模型
“grounded 扇区”更远的内部准粒子轨迹。</strong></p>
  </blockquote>

  <p>这篇文章会围绕 <code class="language-plaintext highlighter-rouge">Qwen3-4B-Base</code> 解释支撑这一假设的证据。</p>

  <hr />

  <h2 id="intelliton-">在 Intelliton 框架里，幻觉是怎么研究的</h2>

  <p><code class="language-plaintext highlighter-rouge">src/hallucination_diagnostic.py</code> 这个模块比较两类提示词：</p>

  <ul>
    <li><strong>Grounded prompts</strong>：答案有事实依据、可验证，而且模型大概率在训练中见过的问题，例如
“法国的首都是哪里？”</li>
    <li><strong>Hallucination-prone prompts</strong>：专门设计来诱发编造的问题，也就是那些听上去像事实问答、
但内容冷门、歧义大，甚至部分虚构的问题。模型很可能会“顺着语气补全”，却补出错误答案。</li>
  </ul>

  <p>对每类提示词，分析会计算多种内部指标：</p>

  <table>
    <thead>
      <tr>
        <th>指标</th>
        <th>含义</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td><strong>奇异值谱散度</strong></td>
        <td>激活模式与 grounded 基线相比有多不同</td>
      </tr>
      <tr>
        <td><strong>相干性</strong></td>
        <td>主导奇异模态有多集中、多稳定</td>
      </tr>
      <tr>
        <td><strong>模式稳定性</strong></td>
        <td>主导物种能否在生成步之间保持一致</td>
      </tr>
      <tr>
        <td><strong>熵差</strong></td>
        <td>能量在不同模式之间分散得有多开</td>
      </tr>
      <tr>
        <td><strong>关键层</strong></td>
        <td>哪些层偏离 grounded 行为最明显</td>
      </tr>
    </tbody>
  </table>

  <p>这些量是在 <em>生成过程中</em> 逐步计算的，也就是模型每产生一个新 token 就更新一次，而不是只
在最终输出后才做分析。</p>

  <p><img src="/assets/images/Qwen3-4B-Base/hallucination_diagnostics.png" alt="Qwen3-4B-Base hallucination diagnostics" />
<em>Qwen3-4B-Base 的幻觉诊断图，对比了 grounded 与 hallucination-prone 提示词的谱特征。</em></p>

  <hr />

  <h2 id="section-1">轨迹证据：最清楚的图像</h2>

  <p>生成期轨迹数据给出了最直观的画面。<code class="language-plaintext highlighter-rouge">Qwen3-4B-Base</code> 的 <code class="language-plaintext highlighter-rouge">intelliton_trajectory_summary.csv</code>
记录了每个生成步、每种提示词类型下的平均模式激活位移和 grounded deviation。</p>

  <h3 id="grounded-">Grounded 提示词：逐步增强而且更相干</h3>

  <p>对 grounded 的事实型提示词，平均模式激活位移从大约 <strong>1.10</strong> 起步，在前 8 个生成步骤里上
升到 <strong>1.37</strong> 左右。主导物种占据度也同步上升，从大约 <strong>70.1% 提升到 74.1%</strong>。</p>

  <p>这意味着，当模型逐渐锁定一个有事实依据的答案时，主导的 Intelliton 扇区会变得 <strong>更强、更
有组织</strong>。模型正在向一个更集中、更相干的内部状态收敛。</p>

  <h3 id="hallucination-prone-">Hallucination-prone 提示词：更弱，而且持续偏离</h3>

  <p>对容易诱发幻觉的提示词，图景就完全不同了。</p>

  <p>平均模式激活位移在整个生成过程中只维持在 <strong>0.32 到 0.41</strong> 之间，大约只有 grounded 情况
的三分之一。与此同时，grounded deviation 始终保持明显负值，大致在 <strong>-9 到 -11</strong> 之间。</p>

  <p>用直白的话说：幻觉倾向型生成对应的是一条内部轨迹，它既 <strong>整体激活更弱</strong>，也 <strong>离 grounded
扇区更远</strong>。</p>

  <p>重要的是，这并不只是“最后一句答错了”。从生成一开始，内部状态就已经持续表现为另一种
动力学区间。</p>

  <h3 id="section-2">风格续写：位于中间地带</h3>

  <p>如果提示词是风格化续写，也就是要求模型继续写一段创意文本，而不是给出事实答案，那么它
的激活位移会处在 grounded 与 hallucination-prone 之间。</p>

  <p>这是一个很有意义的校准结果：风格续写并不等于失败，但它也不被事实 grounding 锚定。
Intelliton 指标把它合理地放在了两极之间。</p>

  <p><img src="/assets/images/Qwen3-4B-Base/intelliton_trajectory_merged.png" alt="Qwen3-4B-Base Intelliton trajectory" />
<em>Qwen3-4B-Base 的生成期 Intelliton 轨迹。grounded（上）、style（中）与 hallucination-prone
（下）提示词呈现出定性上不同的内部动力学轮廓。</em></p>

  <hr />

  <h2 id="section-3">转移图：稳定生成由哪些物种主导</h2>

  <p>Intelliton 转移图展示的是：生成过程中哪些物种之间的跃迁最常见，以及这些跃迁对应的模式激
活有多强。</p>

  <p>对 <code class="language-plaintext highlighter-rouge">Qwen3-4B-Base</code> 的 <strong>grounded 提示词</strong>，最主要的自跃迁是：</p>

  <table>
    <thead>
      <tr>
        <th>转移</th>
        <th>次数</th>
        <th>目标激活平均位移</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td><code class="language-plaintext highlighter-rouge">I_5 → I_5</code></td>
        <td><strong>110</strong></td>
        <td>（强）</td>
      </tr>
      <tr>
        <td><code class="language-plaintext highlighter-rouge">I_1 → I_1</code></td>
        <td>13</td>
        <td>（中等）</td>
      </tr>
      <tr>
        <td><code class="language-plaintext highlighter-rouge">I_2 → I_2</code></td>
        <td>6</td>
        <td>（中等）</td>
      </tr>
    </tbody>
  </table>

  <p>也就是说，生成过程大部分时间都停留在 <code class="language-plaintext highlighter-rouge">I_5</code>，也就是事实回忆物种中，只偶尔偏向 <code class="language-plaintext highlighter-rouge">I_1</code>
（逻辑推理）和 <code class="language-plaintext highlighter-rouge">I_2</code>（算术）。</p>

  <p>而对 <strong>hallucination-prone 提示词</strong> 来说，虽然 <code class="language-plaintext highlighter-rouge">I_5 → I_5</code> 仍然是最常见跃迁，但它的平均
目标激活位移小得多。同时 <code class="language-plaintext highlighter-rouge">I_1</code>、<code class="language-plaintext highlighter-rouge">I_3</code>、<code class="language-plaintext highlighter-rouge">I_5</code> 之间的混合也更多，说明内部轨迹更碎片化，更
缺乏单一主导吸引子。</p>

  <p><img src="/assets/images/Qwen3-4B-Base/intelliton_transition_graph.png" alt="Qwen3-4B-Base transition graph" />
<em>Qwen3-4B-Base 的物种转移图。grounded 生成主要由强自环主导；hallucination 则更弱、更混杂。</em></p>

  <hr />

  <h2 id="section-4">幻觉可能不是单一失败，而是多种失败模式</h2>

  <p>Intelliton 框架最有价值的一点是，它暗示幻觉不一定是一种单一现象。轨迹数据至少与四种不同
失败模式相符合：</p>

  <ol>
    <li><strong>Grounded 激发衰减</strong>：负责事实任务的主导 Intelliton（这里是 <code class="language-plaintext highlighter-rouge">I_5</code>）没能维持激活，模
型因此“失去对事实扇区的抓握”；</li>
    <li><strong>物种碎片化</strong>：不再有强主导自环，轨迹变成多个物种的混合，缺乏清晰吸引子；</li>
    <li><strong>谱展宽</strong>：奇异值谱被摊得更开，意味着主导集体模式失去相干性；</li>
    <li><strong>偏离 grounded 基线</strong>：轨迹持续漂离正确事实生成所对应的 Intelliton 空间区域。</li>
  </ol>

  <p>一次具体的幻觉事件到底会涉及其中一种还是全部几种，可能依赖于模型本身以及幻觉类型。但
至少这个框架为区分这些情况提供了语言和指标。</p>

  <hr />

  <h2 id="section-5">含义与后续方向</h2>

  <p>如果这幅图景在进一步研究中站得住脚，它会带来一些实际可能性。</p>

  <h3 id="section-6">早期预警信号</h3>

  <p>既然偏离 grounded Intelliton 轨迹的信号在生成最初几个步骤就能被检测到，那么原则上就可以
在完整输出形成之前，提前标记潜在幻觉。这可以成为幻觉早期预警系统的基础。</p>

  <h3 id="section-7">对不稳定物种进行干预</h3>

  <p>如果某个物种被识别为 grounded 事实生成的关键承担者，那么就可能通过推理时的模型 steering
技术去稳定或放大它。代码库中像 <code class="language-plaintext highlighter-rouge">src/gauge_intervention.py</code> 这样的模块，已经在暗示这一方向。</p>

  <h3 id="grounded--1">面向 grounded 生成的提示策略</h3>

  <p>如果某些提示词结构更容易产生强 grounded Intelliton 轨迹，那么理解这些结构，就可能帮助我
们设计更好的 prompting 策略，让模型尽量停留在 grounded 扇区内部。</p>

  <h3 id="section-8">从内部稳定性比较模型</h3>

  <p>Intelliton 幻觉指标提供了比较模型的新轴线。我们不只比较基准分数，也可以比较模型内部事实
 grounding 扇区的稳健性和相干性。一个拥有更强、更稳定 <code class="language-plaintext highlighter-rouge">I_5</code> 型物种的模型，可能天然更适合
事实任务。</p>

  <hr />

  <h2 id="section-9">注意事项与开放问题</h2>

  <p>和项目里的其他结果一样，这些发现也有几条重要限定。</p>

  <p><strong>Hallucination-prone 提示词是人为设计的，不是自然收集的。</strong> “grounded” 与
“hallucination-prone” 的区分是通过提示词设计施加进去的。在真实使用场景中，这条边界不会
这么清晰。</p>

  <p><strong>相关不等于因果。</strong> Intelliton 轨迹差异与幻觉倾向有关联，但目前还不能说：只要修复轨迹，
就一定能防止幻觉。</p>

  <p><strong>分析流程本身带有设计选择。</strong> 不同的提示词集合、序列长度和分析参数，可能会给出不同的
目录，也可能带来不同结论。</p>

  <p><strong>这是相对指标，而不是绝对指标。</strong> “grounded deviation” 是相对于一个 grounded 基线定义
出来的，它的含义依赖于基线本身的质量。</p>

  <p>尽管如此，有一个结构性模式是清楚的：grounded 生成在内部上更强、更相干，也更靠近一个清晰
的吸引子。这个模式在 <code class="language-plaintext highlighter-rouge">Qwen3-4B-Base</code> 的所有生成步和两类任务划分中都能稳定看到。</p>

  <hr />

  <h2 id="section-10">这一系列把我们带到了哪里</h2>

  <p>这篇文章为 Intelliton 四篇科普系列收尾：</p>

  <ol>
    <li><strong><a href="/introduction/theory/2026/04/01/what-are-intellitons.html">什么是 Intelliton？</a></strong>：介绍准粒子想法，
以及它为什么可能适用于变换器；</li>
    <li><strong><a href="/case-study/interpretation/2026/04/02/inside-qwen-intelliton-spectrum.html">走进 Qwen3-4B-Base</a></strong>：完整
讲解一个模型的 Intelliton 目录；</li>
    <li><strong><a href="/comparison/scaling/alignment/2026/04/03/scaling-alignment-intellitons.html">规模扩展与对齐</a></strong>：说明参数规
模和指令微调如何重塑内部激发谱；</li>
    <li><strong>把幻觉理解为内部不稳定性</strong>（本文）：把幻觉看作一种可检测的内部动力学区间。</li>
  </ol>

  <p>Intelliton 框架仍然很年轻，也带有探索性质。但它的输出是具体的、比较是可复现的，而它提供
的语言，至少在目前看来，比“把语言模型当作黑箱统计机器”要更有解释力。</p>

  <p>项目的目标，是发展出一套能让神经网络内部生命变得可读的词汇。Intelliton 最终是不是这套词
汇，还需要时间验证。但到目前为止，证据至少说明：它指向了一些真实存在的结构。</p>

  <hr />

  <h2 id="section-11">延伸阅读</h2>

  <ul>
    <li>访问完整代码库：<a href="https://github.com/xiongzhp/Intelliton">github.com/xiongzhp/Intelliton</a></li>
    <li>阅读配套技术论文 <code class="language-plaintext highlighter-rouge">intelliton_arxiv_paper.pdf</code> 获取更形式化的分析</li>
    <li>返回 <a href="/introduction/theory/2026/04/01/what-are-intellitons.html">第 1 篇：什么是 Intelliton？</a></li>
  </ul>

</div>]]></content><author><name>Intellitons Project</name></author><category term="hallucination" /><category term="applications" /><summary type="html"><![CDATA[Hallucination — when a language model confidently produces false or unsupported information — is one of the most pressing practical problems in LLM research. This article explores what the Intelliton framework reveals about hallucination: not as an output-level mistake, but as an instability of internal collective modes during generation.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://intellitons.wiki/assets/icons/android-chrome-512x512.png" /><media:content medium="image" url="https://intellitons.wiki/assets/icons/android-chrome-512x512.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Scaling and Alignment Through the Intelliton Lens</title><link href="https://intellitons.wiki/comparison/scaling/alignment/2026/04/03/scaling-alignment-intellitons.html" rel="alternate" type="text/html" title="Scaling and Alignment Through the Intelliton Lens" /><published>2026-04-03T00:00:00+00:00</published><updated>2026-04-03T00:00:00+00:00</updated><id>https://intellitons.wiki/comparison/scaling/alignment/2026/04/03/scaling-alignment-intellitons</id><content type="html" xml:base="https://intellitons.wiki/comparison/scaling/alignment/2026/04/03/scaling-alignment-intellitons.html"><![CDATA[<div data-lang="en">

  <h2 id="two-of-the-biggest-questions-in-llm-science">Two of the biggest questions in LLM science</h2>

  <p>Two of the most discussed phenomena in large language model research are <strong>scaling</strong> and
<strong>alignment</strong>. Scaling means training bigger models; alignment (here: instruction tuning) means
fine-tuning a model to follow instructions and behave helpfully.</p>

  <p>Both interventions are known to improve benchmark performance. But do they change the <em>internal
structure</em> of the model? Do they reshape the quasi-particle spectrum?</p>

  <p>The Intelliton analysis of five models — <code class="language-plaintext highlighter-rouge">Qwen3-4B-Base</code>, <code class="language-plaintext highlighter-rouge">Qwen3-4B</code>, <code class="language-plaintext highlighter-rouge">Qwen3-8B-Base</code>, <code class="language-plaintext highlighter-rouge">Qwen3-8B</code>,
and <code class="language-plaintext highlighter-rouge">Mistral-7B-v0.3</code> — provides a concrete, data-driven answer.</p>

  <hr />

  <h2 id="base-versus-instruct-what-alignment-does">Base versus Instruct: what alignment does</h2>

  <p>The clearest comparison is between a base model and its instruction-tuned counterpart in the same
family.</p>

  <h3 id="qwen3-4b-base-vs-qwen3-4b">Qwen3-4B-Base vs. Qwen3-4B</h3>

  <table>
    <thead>
      <tr>
        <th>Property</th>
        <th>Qwen3-4B-Base</th>
        <th>Qwen3-4B</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Number of species</td>
        <td><strong>6</strong></td>
        <td><strong>5</strong></td>
      </tr>
      <tr>
        <td><code class="language-plaintext highlighter-rouge">I_0</code> amplitude</td>
        <td>6167.1</td>
        <td>6562.4</td>
      </tr>
      <tr>
        <td>Dominant momentum</td>
        <td><strong>k ≈ π</strong></td>
        <td><strong>k ≈ 1.885</strong></td>
      </tr>
      <tr>
        <td>Secondary momenta</td>
        <td>k ≈ 0</td>
        <td>k ≈ 1.885 (shared)</td>
      </tr>
      <tr>
        <td>Fixed-point types</td>
        <td>IR + crossover</td>
        <td>all <strong>crossover</strong></td>
      </tr>
      <tr>
        <td>Grounded profile mean</td>
        <td>32.68</td>
        <td>29.13</td>
      </tr>
    </tbody>
  </table>

  <p>The most striking difference is in <strong>momentum structure</strong>.</p>

  <p>In <code class="language-plaintext highlighter-rouge">Qwen3-4B-Base</code>, the dominant mode <code class="language-plaintext highlighter-rouge">I_0</code> peaks at k ≈ π (high frequency, alternating pattern),
while the five secondary modes all peak near k ≈ 0 (low frequency, global pattern). There is a
clean split.</p>

  <p>In <code class="language-plaintext highlighter-rouge">Qwen3-4B</code> (the instruction-tuned version), the dominant mode shifts to k ≈ 1.885, and the
secondary modes also cluster around the same momentum. The model becomes more <strong>homogeneous</strong> in
its momentum structure. The clean split between backbone and task-specific modes disappears.</p>

  <p>The fixed-point type change is equally telling. In the base model, five of the six species are
labelled <strong>IR</strong> (settled, stable), while <code class="language-plaintext highlighter-rouge">I_0</code> is <strong>crossover</strong> (still transitioning). In the
instruct model, <strong>all</strong> species are labelled <strong>crossover</strong>. Alignment appears to push the model’s
collective modes into a more uniformly active, less settled dynamical regime.</p>

  <p>One possible interpretation:</p>

  <blockquote>
    <p><strong>Instruction tuning compresses or reorganises the internal excitation landscape into a more
uniform effective regime. The spectral diversity that exists in the base model is partially
smoothed out, and more modes are kept in a dynamically transitional state.</strong></p>
  </blockquote>

  <p>This does not mean instruction tuning is worse. It may mean the model’s degrees of freedom are being
regularised toward instruction-following behaviour, possibly at the cost of some internal
differentiation.</p>

  <p><img src="/assets/images/Qwen3-4B/particle_table.png" alt="Qwen3-4B particle table" />
<em>The Intelliton catalogue for Qwen3-4B (instruction-tuned). Compare with Qwen3-4B-Base to see
the homogenisation of momentum structure.</em></p>

  <hr />

  <h2 id="scaling-from-4b-to-8b-what-more-parameters-do">Scaling from 4B to 8B: what more parameters do</h2>

  <p>The next comparison holds model family (Qwen3) and training type (base) constant and varies the
parameter count.</p>

  <h3 id="qwen3-4b-base-vs-qwen3-8b-base">Qwen3-4B-Base vs. Qwen3-8B-Base</h3>

  <table>
    <thead>
      <tr>
        <th>Property</th>
        <th>Qwen3-4B-Base</th>
        <th>Qwen3-8B-Base</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Number of species</td>
        <td>6</td>
        <td><strong>7</strong></td>
      </tr>
      <tr>
        <td><code class="language-plaintext highlighter-rouge">I_0</code> amplitude</td>
        <td>6167.1</td>
        <td><strong>7908.4</strong></td>
      </tr>
      <tr>
        <td>Dominant momentum</td>
        <td>k ≈ π</td>
        <td>k ≈ π</td>
      </tr>
      <tr>
        <td>Grounded profile mean</td>
        <td>32.68</td>
        <td><strong>60.26</strong></td>
      </tr>
      <tr>
        <td>Grounded-hallucination separation</td>
        <td>moderate</td>
        <td><strong>larger</strong></td>
      </tr>
    </tbody>
  </table>

  <p>The 8B model has one more species. Its leading mode <code class="language-plaintext highlighter-rouge">I_0</code> is <strong>28% stronger</strong> in amplitude.
Most strikingly, the grounded generation profile mean nearly <strong>doubles</strong> (32.68 → 60.26).</p>

  <p>In Intelliton terms, scaling up does not simply add more parameters uniformly. It appears to
<strong>amplify the dominant dynamical sectors</strong> of the model, making the leading quasi-particle modes
substantially stronger and the grounded generation trajectory more sharply defined.</p>

  <p>The momentum structure is also richer in the 8B model. While the leading mode still peaks at
k ≈ π, many of the secondary species cluster around k ≈ 1.885 rather than strictly k ≈ 0. This
suggests that intermediate-scale spatial organisation becomes more visible as the model grows.</p>

  <p><img src="/assets/images/Qwen3-8B-Base/particle_table.png" alt="Qwen3-8B-Base particle table" />
<em>The Intelliton catalogue for Qwen3-8B-Base. The leading mode is stronger, and the species set
is slightly larger compared with Qwen3-4B-Base.</em></p>

  <hr />

  <h2 id="scaling-plus-alignment-qwen3-8b">Scaling plus alignment: Qwen3-8B</h2>

  <p>When both scaling and instruction tuning are applied — <code class="language-plaintext highlighter-rouge">Qwen3-8B</code> — the results follow the pattern
suggested by the two effects separately.</p>

  <table>
    <thead>
      <tr>
        <th>Property</th>
        <th>Qwen3-8B-Base</th>
        <th>Qwen3-8B</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Number of species</td>
        <td>7</td>
        <td><strong>6</strong></td>
      </tr>
      <tr>
        <td><code class="language-plaintext highlighter-rouge">I_0</code> amplitude</td>
        <td>7908.4</td>
        <td>~7600</td>
      </tr>
      <tr>
        <td>Grounded profile mean</td>
        <td>60.26</td>
        <td><strong>57.14</strong></td>
      </tr>
      <tr>
        <td>Fixed-point types</td>
        <td>mixed</td>
        <td>more crossover</td>
      </tr>
    </tbody>
  </table>

  <p>Instruction tuning at 8B scale slightly reduces the species count (7 → 6) and slightly lowers the
dominant mode amplitude and grounded profile mean, consistent with the homogenisation effect seen
at 4B scale. But the 8B instruct model remains far stronger than the 4B models in its grounded
trajectory profile.</p>

  <p>The combined takeaway is clean:</p>

  <ul>
    <li><strong>Scaling to 8B</strong> increases the strength of dominant collective modes and sharpens the grounded
generation signal.</li>
    <li><strong>Instruction tuning</strong> slightly compresses or regularises that internal structure, reducing
species count and reducing the grounded-hallucination separation.</li>
  </ul>

  <p>These two effects appear to be largely independent and roughly additive in their impact on the
Intelliton spectrum.</p>

  <p><img src="/assets/images/Qwen3-8B/particle_table.png" alt="Qwen3-8B particle table" />
<em>The Intelliton catalogue for Qwen3-8B (instruction-tuned 8B model).</em></p>

  <hr />

  <h2 id="a-completely-different-family-mistral-7b-v03">A completely different family: Mistral-7B-v0.3</h2>

  <p>The most dramatic contrast in the entire comparison set comes from <code class="language-plaintext highlighter-rouge">Mistral-7B-v0.3</code>, a 7B model
from a different architecture family.</p>

  <table>
    <thead>
      <tr>
        <th>Property</th>
        <th>Qwen3-4B-Base</th>
        <th>Mistral-7B-v0.3</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Number of species</td>
        <td>6</td>
        <td><strong>25</strong></td>
      </tr>
      <tr>
        <td><code class="language-plaintext highlighter-rouge">I_0</code> amplitude</td>
        <td>6167.1</td>
        <td><strong>249.4</strong></td>
      </tr>
      <tr>
        <td>Grounded profile mean</td>
        <td>32.68</td>
        <td><strong>21.48</strong></td>
      </tr>
      <tr>
        <td>Grounded profile std</td>
        <td>moderate</td>
        <td><strong>46.56</strong> (very large)</td>
      </tr>
      <tr>
        <td>Fixed-point types</td>
        <td>IR + crossover</td>
        <td>UV + IR + crossover</td>
      </tr>
    </tbody>
  </table>

  <p>Under the same analysis pipeline, Mistral produces <strong>25 species</strong> — more than four times as many as
any Qwen model. This is a striking result.</p>

  <p>The leading species <code class="language-plaintext highlighter-rouge">I_0</code> in Mistral has an amplitude of only 249.4, compared with 6167 in
<code class="language-plaintext highlighter-rouge">Qwen3-4B-Base</code>. In other words, the Mistral model does not have a strongly dominant backbone
excitation. Its collective mode landscape is much more <strong>fragmented</strong>: many modes of comparable
strength, rather than one overwhelming mode with several weak followers.</p>

  <p>Mistral also shows <strong>UV-labelled species</strong> — modes that remain in a fine-grained, ultraviolet-like
dynamical state throughout the network, rather than flowing toward infrared stability. This suggests
a more persistent fine-grained structure in Mistral’s layers compared with Qwen.</p>

  <p>The generation dynamics also differ. The grounded profile mean for Mistral (21.48) is lower than
for all Qwen models, and the standard deviation (46.56) is much larger. The Intelliton metric,
calibrated using Qwen, describes Mistral’s trajectory space as much noisier and more turbulent.</p>

  <p>One interpretation:</p>

  <blockquote>
    <p><strong>Qwen organises its internal computation around a few very strong collective modes. Mistral
spreads the load more broadly across many smaller modes. Under this analysis, they have genuinely
different “particle physics” inside.</strong></p>
  </blockquote>

  <p>Whether this difference reflects architectural choices, training data, training procedure, or some
combination is an open question. But the Intelliton framework makes the difference visible and
measurable.</p>

  <p><img src="/assets/images/Mistral-7B-v0.3/particle_table.png" alt="Mistral-7B-v0.3 particle table" />
<em>The Intelliton catalogue for Mistral-7B-v0.3. Twenty-five species, a far more fragmented landscape
than any Qwen model.</em></p>

  <hr />

  <h2 id="a-summary-table">A summary table</h2>

  <table>
    <thead>
      <tr>
        <th>Model</th>
        <th>Species</th>
        <th><code class="language-plaintext highlighter-rouge">I_0</code> Amplitude</th>
        <th>Momentum structure</th>
        <th>Fixed-point types</th>
        <th>Grounded mean</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Qwen3-4B-Base</td>
        <td>6</td>
        <td>6167</td>
        <td>π + 0 (split)</td>
        <td>IR + crossover</td>
        <td>32.68</td>
      </tr>
      <tr>
        <td>Qwen3-4B</td>
        <td>5</td>
        <td>6562</td>
        <td>1.885 (homogeneous)</td>
        <td>all crossover</td>
        <td>29.13</td>
      </tr>
      <tr>
        <td>Qwen3-8B-Base</td>
        <td>7</td>
        <td>7908</td>
        <td>π + 1.885 (richer)</td>
        <td>mixed</td>
        <td>60.26</td>
      </tr>
      <tr>
        <td>Qwen3-8B</td>
        <td>6</td>
        <td>~7600</td>
        <td>1.885 (homogeneous)</td>
        <td>more crossover</td>
        <td>57.14</td>
      </tr>
      <tr>
        <td>Mistral-7B-v0.3</td>
        <td>25</td>
        <td>249</td>
        <td>fragmented</td>
        <td>UV + IR + crossover</td>
        <td>21.48</td>
      </tr>
    </tbody>
  </table>

  <hr />

  <h2 id="conclusions">Conclusions</h2>

  <p>The Intelliton comparison across these five models yields several clear empirical regularities:</p>

  <ol>
    <li><strong>Instruction tuning homogenises the momentum structure</strong> and shifts more species into crossover
regimes, reducing internal spectral diversity.</li>
    <li><strong>Scaling from 4B to 8B strengthens the dominant dynamical sectors</strong>, producing a more strongly
occupied and more sharply separated Intelliton landscape.</li>
    <li><strong>Different model families can have qualitatively different internal spectra</strong> — Qwen is dominated
by a few strong modes, Mistral is more fragmented and UV-rich.</li>
    <li>The Intelliton framework provides a vocabulary for these differences that goes beyond benchmark
accuracy or parameter count alone.</li>
  </ol>

  <p>The next article turns to one of the most practically important applications of this framework:
using the Intelliton lens to study <strong>hallucination</strong> — and asking whether internal spectral
instability can be a diagnostic signal for when a model is about to confabulate.</p>

  <hr />

  <h2 id="further-reading">Further reading</h2>

  <ul>
    <li>Continue to <a href="/hallucination/applications/2026/04/04/intellitons-and-hallucination.html">Article 4: Hallucination as Internal Instability</a></li>
    <li>Return to <a href="/case-study/interpretation/2026/04/02/inside-qwen-intelliton-spectrum.html">Article 2: Inside Qwen3-4B-Base</a></li>
    <li>Return to <a href="/introduction/theory/2026/04/01/what-are-intellitons.html">Article 1: What Are Intellitons?</a></li>
  </ul>

</div>

<div data-lang="zh">

  <h2 id="llm-">LLM 科学里最重要的两个问题</h2>

  <p>在大语言模型研究中，讨论最多的两个现象就是 <strong>规模扩展</strong> 和 <strong>对齐</strong>。规模扩展指训练更大
的模型；对齐在这里主要指指令微调，也就是把模型微调得更会遵循指令、更像一个“有帮助的
助手”。</p>

  <p>这两种干预都已知能提升基准性能。但它们会不会改变模型的 <em>内部结构</em>？会不会改写它的准粒
子谱？</p>

  <p>对五个模型的 Intelliton 分析，也就是 <code class="language-plaintext highlighter-rouge">Qwen3-4B-Base</code>、<code class="language-plaintext highlighter-rouge">Qwen3-4B</code>、<code class="language-plaintext highlighter-rouge">Qwen3-8B-Base</code>、
<code class="language-plaintext highlighter-rouge">Qwen3-8B</code> 和 <code class="language-plaintext highlighter-rouge">Mistral-7B-v0.3</code>，给出了一个具体、数据驱动的回答。</p>

  <hr />

  <h2 id="base--instruct">Base 对 Instruct：对齐到底做了什么</h2>

  <p>最直接的比较，是把同一家族里的基础模型与对应的指令微调版本放在一起。</p>

  <h3 id="qwen3-4b-base-vs-qwen3-4b-1">Qwen3-4B-Base vs. Qwen3-4B</h3>

  <table>
    <thead>
      <tr>
        <th>属性</th>
        <th>Qwen3-4B-Base</th>
        <th>Qwen3-4B</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>物种数量</td>
        <td><strong>6</strong></td>
        <td><strong>5</strong></td>
      </tr>
      <tr>
        <td><code class="language-plaintext highlighter-rouge">I_0</code> 振幅</td>
        <td>6167.1</td>
        <td>6562.4</td>
      </tr>
      <tr>
        <td>主导动量</td>
        <td><strong>k ≈ π</strong></td>
        <td><strong>k ≈ 1.885</strong></td>
      </tr>
      <tr>
        <td>次级动量</td>
        <td>k ≈ 0</td>
        <td>k ≈ 1.885（共享）</td>
      </tr>
      <tr>
        <td>固定点类型</td>
        <td>IR + crossover</td>
        <td>全部为 <strong>crossover</strong></td>
      </tr>
      <tr>
        <td>grounded 轨迹均值</td>
        <td>32.68</td>
        <td>29.13</td>
      </tr>
    </tbody>
  </table>

  <p>最突出的差异在于 <strong>动量结构</strong>。</p>

  <p>在 <code class="language-plaintext highlighter-rouge">Qwen3-4B-Base</code> 中，主导模式 <code class="language-plaintext highlighter-rouge">I_0</code> 位于 k ≈ π，也就是高频交替模式；而五个次级模式都位
于 k ≈ 0，也就是低频、较全局的模式。两者之间有很干净的分裂。</p>

  <p>在 <code class="language-plaintext highlighter-rouge">Qwen3-4B</code> 这个指令微调版本里，主导模式移到了 k ≈ 1.885，次级模式也集中在同样的动量
附近。模型的动量结构变得更 <strong>同质化</strong>，原本骨干模态与任务模态之间清晰的分裂消失了。</p>

  <p>固定点类型的变化同样耐人寻味。在基础模型里，六个物种中的五个是 <strong>IR</strong>，表示已经稳定下
来；只有 <code class="language-plaintext highlighter-rouge">I_0</code> 属于 <strong>crossover</strong>，还在过渡。到了 instruct 模型，<strong>所有</strong> 物种都变成了
 <strong>crossover</strong>。这意味着对齐似乎把模型的集体模式推向一个更统一、更活跃、也更不完全稳定
的动力学区间。</p>

  <p>一种可能的解释是：</p>

  <blockquote>
    <p><strong>指令微调把内部激发景观压缩或重组进了一个更均匀的有效区间。基础模型里原本存在的谱多
样性被部分抹平，更多模式被维持在动态过渡状态。</strong></p>
  </blockquote>

  <p>这并不意味着指令微调更差。更合理的理解是：模型的自由度被正则化到更偏向指令遵循的行为
上，而代价可能是内部结构的一部分区分度下降。</p>

  <p><img src="/assets/images/Qwen3-4B/particle_table.png" alt="Qwen3-4B particle table" />
<em>Qwen3-4B（指令微调版）的 Intelliton 目录。与 Qwen3-4B-Base 对照，可以清楚看到动量结构
的同质化。</em></p>

  <hr />

  <h2 id="b--8b">从 4B 扩到 8B：更多参数带来了什么</h2>

  <p>下一组比较固定模型家族（Qwen3）和训练类型（base），只改变参数规模。</p>

  <h3 id="qwen3-4b-base-vs-qwen3-8b-base-1">Qwen3-4B-Base vs. Qwen3-8B-Base</h3>

  <table>
    <thead>
      <tr>
        <th>属性</th>
        <th>Qwen3-4B-Base</th>
        <th>Qwen3-8B-Base</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>物种数量</td>
        <td>6</td>
        <td><strong>7</strong></td>
      </tr>
      <tr>
        <td><code class="language-plaintext highlighter-rouge">I_0</code> 振幅</td>
        <td>6167.1</td>
        <td><strong>7908.4</strong></td>
      </tr>
      <tr>
        <td>主导动量</td>
        <td>k ≈ π</td>
        <td>k ≈ π</td>
      </tr>
      <tr>
        <td>grounded 轨迹均值</td>
        <td>32.68</td>
        <td><strong>60.26</strong></td>
      </tr>
      <tr>
        <td>grounded 与 hallucination 的分离度</td>
        <td>中等</td>
        <td><strong>更大</strong></td>
      </tr>
    </tbody>
  </table>

  <p>8B 模型多出了一个物种，而它的主导模式 <code class="language-plaintext highlighter-rouge">I_0</code> 振幅也 <strong>增强了 28%</strong>。更显著的是，grounded
生成轨迹的均值几乎 <strong>翻倍</strong>（32.68 → 60.26）。</p>

  <p>用 Intelliton 的语言来说，规模扩展并不是简单地给模型“均匀加参数”，而更像是在
<strong>放大主导动力学扇区</strong>，让领先的准粒子模式更强，同时让 grounded 生成轨迹更清晰、更稳定。</p>

  <p>8B 模型的动量结构也更丰富。虽然主导模式仍然位于 k ≈ π，但很多次级物种集中在 k ≈ 1.885，
而不再严格卡在 k ≈ 0。这暗示着，随着模型变大，中等尺度的空间组织结构开始更明显地浮现。</p>

  <p><img src="/assets/images/Qwen3-8B-Base/particle_table.png" alt="Qwen3-8B-Base particle table" />
<em>Qwen3-8B-Base 的 Intelliton 目录。相比 Qwen3-4B-Base，领先模式更强，物种集合也略大。</em></p>

  <hr />

  <h2 id="qwen3-8b">规模扩展加上对齐：Qwen3-8B</h2>

  <p>当规模扩展和指令微调同时发生时，也就是 <code class="language-plaintext highlighter-rouge">Qwen3-8B</code>，结果大体延续了前两种效应各自的趋向。</p>

  <table>
    <thead>
      <tr>
        <th>属性</th>
        <th>Qwen3-8B-Base</th>
        <th>Qwen3-8B</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>物种数量</td>
        <td>7</td>
        <td><strong>6</strong></td>
      </tr>
      <tr>
        <td><code class="language-plaintext highlighter-rouge">I_0</code> 振幅</td>
        <td>7908.4</td>
        <td>~7600</td>
      </tr>
      <tr>
        <td>grounded 轨迹均值</td>
        <td>60.26</td>
        <td><strong>57.14</strong></td>
      </tr>
      <tr>
        <td>固定点类型</td>
        <td>混合</td>
        <td>更多 crossover</td>
      </tr>
    </tbody>
  </table>

  <p>在 8B 规模下，指令微调让物种数略微下降（7 → 6），同时稍微降低了主导模式振幅和 grounded
轨迹均值，这与 4B 上观察到的同质化趋势是一致的。但即便如此，8B instruct 模型在 grounded
轨迹上的强度仍然明显高于所有 4B 模型。</p>

  <p>合在一起看，结论很清楚：</p>

  <ul>
    <li><strong>扩展到 8B</strong> 会增强主导集体模式，并让 grounded 生成信号更尖锐；</li>
    <li><strong>指令微调</strong> 会略微压缩或正则化这种内部结构，减少物种数量，并缩小 grounded 与 hallucination
之间的间隔。</li>
  </ul>

  <p>这两个效应在 Intelliton 谱上的影响看起来大致相互独立，而且近似可叠加。</p>

  <p><img src="/assets/images/Qwen3-8B/particle_table.png" alt="Qwen3-8B particle table" />
<em>Qwen3-8B（8B 指令微调版）的 Intelliton 目录。</em></p>

  <hr />

  <h2 id="mistral-7b-v03">完全不同的家族：Mistral-7B-v0.3</h2>

  <p>整个对比集中最戏剧性的差异，来自 <code class="language-plaintext highlighter-rouge">Mistral-7B-v0.3</code>，一个属于完全不同架构家族的 7B 模型。</p>

  <table>
    <thead>
      <tr>
        <th>属性</th>
        <th>Qwen3-4B-Base</th>
        <th>Mistral-7B-v0.3</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>物种数量</td>
        <td>6</td>
        <td><strong>25</strong></td>
      </tr>
      <tr>
        <td><code class="language-plaintext highlighter-rouge">I_0</code> 振幅</td>
        <td>6167.1</td>
        <td><strong>249.4</strong></td>
      </tr>
      <tr>
        <td>grounded 轨迹均值</td>
        <td>32.68</td>
        <td><strong>21.48</strong></td>
      </tr>
      <tr>
        <td>grounded 轨迹标准差</td>
        <td>中等</td>
        <td><strong>46.56</strong>（非常大）</td>
      </tr>
      <tr>
        <td>固定点类型</td>
        <td>IR + crossover</td>
        <td>UV + IR + crossover</td>
      </tr>
    </tbody>
  </table>

  <p>在相同分析管线下，Mistral 产生了 <strong>25 个物种</strong>，是任一 Qwen 模型的四倍以上。这是非常醒目
的结果。</p>

  <p>Mistral 的领先物种 <code class="language-plaintext highlighter-rouge">I_0</code> 振幅只有 249.4，而 <code class="language-plaintext highlighter-rouge">Qwen3-4B-Base</code> 中对应值是 6167。换句话说，
Mistral 并没有一个压倒性的骨干激发。它的集体模式景观更加 <strong>碎片化</strong>：很多模式强度彼此接
近，而不是一个特别强、后面跟着几条弱尾巴。</p>

  <p>Mistral 还出现了 <strong>UV 型物种</strong>，也就是那些在整个网络中都保持细粒度、紫外式动力学状态，
而不会流向红外稳定的模式。这表明，相比 Qwen，Mistral 的层内细粒度结构保留得更久。</p>

  <p>生成动力学也不同。Mistral 的 grounded 轨迹均值（21.48）低于所有 Qwen 模型，而标准差
（46.56）则大得多。以 Qwen 为标定的 Intelliton 指标会把 Mistral 的轨迹空间描述为更嘈杂、
更湍动。</p>

  <p>一种可能的总结是：</p>

  <blockquote>
    <p><strong>Qwen 把内部计算组织在少数几个极强的集体模式周围；Mistral 则把负载分散到许多较小模
式上。从这个分析看，它们内部确实像拥有不同的“粒子物理学”。</strong></p>
  </blockquote>

  <p>这种差异到底来自架构、训练数据、训练流程，还是多种因素叠加，目前仍是开放问题。但
Intelliton 框架至少把这种差异清晰地呈现并量化了出来。</p>

  <p><img src="/assets/images/Mistral-7B-v0.3/particle_table.png" alt="Mistral-7B-v0.3 particle table" />
<em>Mistral-7B-v0.3 的 Intelliton 目录。共 25 个物种，比任何 Qwen 模型都碎片化得多。</em></p>

  <hr />

  <h2 id="section">汇总表</h2>

  <table>
    <thead>
      <tr>
        <th>模型</th>
        <th>物种数</th>
        <th><code class="language-plaintext highlighter-rouge">I_0</code> 振幅</th>
        <th>动量结构</th>
        <th>固定点类型</th>
        <th>grounded 均值</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Qwen3-4B-Base</td>
        <td>6</td>
        <td>6167</td>
        <td>π + 0（分裂）</td>
        <td>IR + crossover</td>
        <td>32.68</td>
      </tr>
      <tr>
        <td>Qwen3-4B</td>
        <td>5</td>
        <td>6562</td>
        <td>1.885（同质）</td>
        <td>全部 crossover</td>
        <td>29.13</td>
      </tr>
      <tr>
        <td>Qwen3-8B-Base</td>
        <td>7</td>
        <td>7908</td>
        <td>π + 1.885（更丰富）</td>
        <td>混合</td>
        <td>60.26</td>
      </tr>
      <tr>
        <td>Qwen3-8B</td>
        <td>6</td>
        <td>~7600</td>
        <td>1.885（同质）</td>
        <td>更多 crossover</td>
        <td>57.14</td>
      </tr>
      <tr>
        <td>Mistral-7B-v0.3</td>
        <td>25</td>
        <td>249</td>
        <td>碎片化</td>
        <td>UV + IR + crossover</td>
        <td>21.48</td>
      </tr>
    </tbody>
  </table>

  <hr />

  <h2 id="section-1">结论</h2>

  <p>这五个模型的 Intelliton 对比给出了几条相当清楚的经验规律：</p>

  <ol>
    <li><strong>指令微调会同质化动量结构</strong>，并把更多物种推入 crossover 区间，从而降低内部谱多样性；</li>
    <li><strong>从 4B 扩展到 8B 会强化主导动力学扇区</strong>，形成占据更强、分离更清晰的 Intelliton 景观；</li>
    <li><strong>不同模型家族可以拥有定性上非常不同的内部谱</strong>：Qwen 由少数强模式主导，Mistral 则更
碎片化，也更偏 UV；</li>
    <li>Intelliton 框架为这些差异提供了一套超越基准分数和参数规模的描述语言。</li>
  </ol>

  <p>下一篇文章会把这个框架用于一个更直接的应用问题：<strong>幻觉</strong>。我们将问，内部谱不稳定性是否
能成为模型即将开始“编造”的诊断信号。</p>

  <hr />

  <h2 id="section-2">延伸阅读</h2>

  <ul>
    <li>继续阅读 <a href="/hallucination/applications/2026/04/04/intellitons-and-hallucination.html">第 4 篇：把幻觉理解为内部不稳定性</a></li>
    <li>返回 <a href="/case-study/interpretation/2026/04/02/inside-qwen-intelliton-spectrum.html">第 2 篇：走进 Qwen3-4B-Base</a></li>
    <li>返回 <a href="/introduction/theory/2026/04/01/what-are-intellitons.html">第 1 篇：什么是 Intelliton？</a></li>
  </ul>

</div>]]></content><author><name>Intellitons Project</name></author><category term="comparison" /><category term="scaling" /><category term="alignment" /><summary type="html"><![CDATA[What happens to a model's internal quasi-particle spectrum when you double the parameter count? What does instruction tuning do to the excitation landscape? This article compares four Qwen3 models — 4B vs 8B, Base vs Instruct — and adds Mistral-7B-v0.3 for a cross-family perspective.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://intellitons.wiki/assets/icons/android-chrome-512x512.png" /><media:content medium="image" url="https://intellitons.wiki/assets/icons/android-chrome-512x512.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">How to Read `I_0` to `I_4`: A Human Guide to an Intelliton Spectrum</title><link href="https://intellitons.wiki/case-study/interpretation/2026/04/02/inside-qwen-intelliton-spectrum.html" rel="alternate" type="text/html" title="How to Read `I_0` to `I_4`: A Human Guide to an Intelliton Spectrum" /><published>2026-04-02T00:00:00+00:00</published><updated>2026-04-02T00:00:00+00:00</updated><id>https://intellitons.wiki/case-study/interpretation/2026/04/02/inside-qwen-intelliton-spectrum</id><content type="html" xml:base="https://intellitons.wiki/case-study/interpretation/2026/04/02/inside-qwen-intelliton-spectrum.html"><![CDATA[<div data-lang="en">

  <h2 id="read-the-report-with-the-right-mental-model-first">Read the report with the right mental model first</h2>

  <p>The safest way to read an Intelliton spectrum is this:</p>

  <ul>
    <li>the species labels <code class="language-plaintext highlighter-rouge">I_0</code>, <code class="language-plaintext highlighter-rouge">I_1</code>, <code class="language-plaintext highlighter-rouge">I_2</code>, and so on are <strong>recurring modes</strong>, not literal particles,</li>
    <li>the physics vocabulary is a compact description of behaviour, not a claim of hidden quantum matter,</li>
    <li>what matters most is the <strong>role</strong> of a mode, not its dramatic name.</li>
  </ul>

  <p>In the report discussed here, the main modes are broad sequence-scale patterns rather than tiny
token-local ripples. Their masses are relatively light, which means they persist through many
layers, and their helicity proxy is fairly stable, which means their directional signature is not
being completely scrambled as the network gets deeper.</p>

  <p>That already gives a useful picture: these are not one-off flashes. They are reusable internal
carriers.</p>

  <hr />

  <h2 id="before-the-species-list-decode-the-four-columns">Before the species list, decode the four columns</h2>

  <p>If a spectrum table feels abstract, reduce it to four plain questions.</p>

  <h3 id="momentum">Momentum</h3>

  <p>Momentum asks whether the pattern is smooth across token positions or rapidly oscillating.</p>

  <ul>
    <li>low momentum means a broad, global sequence pattern,</li>
    <li>high momentum means sharper token-to-token variation.</li>
  </ul>

  <p>In the report discussed here, the important species sit close to the low-momentum end, so the best
mental image is not a tiny local feature but a large-scale background shape spread across the
sequence.</p>

  <h3 id="spin-like-score">Spin-like score</h3>

  <p>This is not literal spin. It is better read as <strong>internal complexity</strong>.</p>

  <ul>
    <li>low spin-like score means one dominant internal direction stands out,</li>
    <li>high spin-like score means several comparable directions are mixed together.</li>
  </ul>

  <h3 id="mass">Mass</h3>

  <p>Mass tells you how fast a mode fades with depth.</p>

  <ul>
    <li>light modes survive many layers,</li>
    <li>heavy modes disappear quickly.</li>
  </ul>

  <p>So when the report says the species are light, it is really saying they are not shallow noise. They
are able to propagate through the stack.</p>

  <h3 id="helicity-proxy">Helicity proxy</h3>

  <p>Helicity here means a simplified combination of propagation direction and internal orientation.</p>

  <ul>
    <li>stable helicity means the mode keeps a recognisable directional signature,</li>
    <li>unstable helicity means that signature is getting mixed away.</li>
  </ul>

  <hr />

  <h2 id="i0-the-default-continuation-backbone"><code class="language-plaintext highlighter-rouge">I_0</code>: the default continuation backbone</h2>

  <p><code class="language-plaintext highlighter-rouge">I_0</code> is the easiest species to explain because it is both the strongest and the simplest.</p>

  <p>In the report, it has the largest amplitude and the lowest spin-like complexity among the leading
species. The plain-language reading is:</p>

  <blockquote>
    <p><code class="language-plaintext highlighter-rouge">I_0</code> behaves like a strong background mode that helps the model keep a sentence moving toward a
plausible answer.</p>
  </blockquote>

  <p>It is less like a specific fact and more like a <strong>general continuation scaffold</strong>. When the prompt
is something like “If all dogs are animals…” or “What is 7 + 8?”, <code class="language-plaintext highlighter-rouge">I_0</code> looks like the broad mode
that helps open and stabilise the answer slot.</p>

  <p>If you want a slogan, <code class="language-plaintext highlighter-rouge">I_0</code> is the model’s “keep the computation on the rails” mode.</p>

  <hr />

  <h2 id="i1-the-quiet-structural-support"><code class="language-plaintext highlighter-rouge">I_1</code>: the quiet structural support</h2>

  <p><code class="language-plaintext highlighter-rouge">I_1</code> is best read as a support mode rather than a flashy decision-maker.</p>

  <p>In intervention-style reading, changing <code class="language-plaintext highlighter-rouge">I_1</code> often produces smaller visible output shifts than
changing the stronger causal modes. That does <strong>not</strong> mean it is useless. It usually means it is
too infrastructural to show up as an obvious word swap.</p>

  <p>The plain-language reading is:</p>

  <blockquote>
    <p><code class="language-plaintext highlighter-rouge">I_1</code> looks like a structural support mode that helps maintain the shape and stability of the
representation while other modes do more task-specific work.</p>
  </blockquote>

  <p>Think of it as scaffolding rather than the headline feature.</p>

  <hr />

  <h2 id="i2-a-reference-resolution-mode-with-a-person-like-bias"><code class="language-plaintext highlighter-rouge">I_2</code>: a reference-resolution mode with a person-like bias</h2>

  <p>The most intuitive reading of <code class="language-plaintext highlighter-rouge">I_2</code> comes from pronoun-style prompts such as:</p>

  <blockquote>
    <p>“Alice gave Bob a book. He thanked her for …”</p>
  </blockquote>

  <p>In the report, amplifying <code class="language-plaintext highlighter-rouge">I_2</code> nudges the output toward a more person-centered, masculine pronoun
interpretation. That makes <code class="language-plaintext highlighter-rouge">I_2</code> feel less like a generic language mode and more like a <strong>reference
selection channel</strong>.</p>

  <p>The plain-language reading is:</p>

  <blockquote>
    <p><code class="language-plaintext highlighter-rouge">I_2</code> appears to help the model decide which person-like entity the sentence is currently tracking.</p>
  </blockquote>

  <p>That does not make it a literal “male pronoun particle.” It means that, in this probe, the mode is
consistently involved when the model has to collapse a messy discourse context into one concrete
referent.</p>

  <hr />

  <h2 id="i3-a-higher-complexity-mixing-mode"><code class="language-plaintext highlighter-rouge">I_3</code>: a higher-complexity mixing mode</h2>

  <p><code class="language-plaintext highlighter-rouge">I_3</code> looks less like a single-purpose button and more like a mixed coordination mode.</p>

  <p>Its spin-like complexity is higher, which suggests that it is built from several comparably relevant
internal directions rather than one clean axis. That usually happens in prompts where the model must
hold multiple constraints in mind at once.</p>

  <p>The plain-language reading is:</p>

  <blockquote>
    <p><code class="language-plaintext highlighter-rouge">I_3</code> behaves like a mode for combining several partial constraints into one workable internal
state.</p>
  </blockquote>

  <p>So rather than deciding one token directly, <code class="language-plaintext highlighter-rouge">I_3</code> is better imagined as part of the middle-layer
machinery that keeps complex reasoning or structured sentence interpretation coherent.</p>

  <hr />

  <h2 id="i4-a-complementary-reference-mode"><code class="language-plaintext highlighter-rouge">I_4</code>: a complementary reference mode</h2>

  <p><code class="language-plaintext highlighter-rouge">I_4</code> looks related to <code class="language-plaintext highlighter-rouge">I_2</code>, but with a different directional bias in pronoun-style settings.</p>

  <p>In the report, amplifying <code class="language-plaintext highlighter-rouge">I_4</code> can nudge outputs toward forms like “her” rather than a neutral or
object-like continuation. The plain-language reading is:</p>

  <blockquote>
    <p><code class="language-plaintext highlighter-rouge">I_4</code> is another reference-sensitive mode, complementary to <code class="language-plaintext highlighter-rouge">I_2</code>, and appears when the model has
to settle on a different discourse framing of who is being talked about.</p>
  </blockquote>

  <p>This is useful because it shows that “pronoun tracking” is not a single monolithic skill. The model
can separate that work into several nearby but distinct modes.</p>

  <hr />

  <h2 id="what-the-whole-spectrum-says-in-one-paragraph">What the whole spectrum says in one paragraph</h2>

  <p>Taken together, <code class="language-plaintext highlighter-rouge">I_0</code> to <code class="language-plaintext highlighter-rouge">I_4</code> tell a coherent story.</p>

  <ul>
    <li><code class="language-plaintext highlighter-rouge">I_0</code> is a strong general backbone.</li>
    <li><code class="language-plaintext highlighter-rouge">I_1</code> helps keep the internal state stable.</li>
    <li><code class="language-plaintext highlighter-rouge">I_2</code> and <code class="language-plaintext highlighter-rouge">I_4</code> are more obviously tied to reference selection.</li>
    <li><code class="language-plaintext highlighter-rouge">I_3</code> looks like a higher-complexity mixing mode.</li>
  </ul>

  <p>That is why the Intelliton view can be useful. It turns a huge hidden state into a cast of recurring
roles.</p>

  <hr />

  <h2 id="what-not-to-over-interpret">What not to over-interpret</h2>

  <p>There are two important cautions.</p>

  <ol>
    <li>Species indices are bookkeeping labels. <code class="language-plaintext highlighter-rouge">I_2</code> in one run is not guaranteed to mean exactly the
same thing in every future run.</li>
    <li>Terms like momentum, spin, and helicity are proxies. They organise evidence, but they are not
proof that the network literally contains particle-like objects.</li>
  </ol>

  <p>The disciplined reading is: these labels help summarise recurrent activation roles.</p>

  <hr />

  <h2 id="continue-reading">Continue reading</h2>

  <ul>
    <li><a href="/applications/tasks/interpretation/2026/04/05/why-different-prompts-light-up-different-intellitons.html">Why Different Prompts Light Up Different Intellitons</a></li>
    <li><a href="/comparison/scaling/alignment/2026/04/03/scaling-alignment-intellitons.html">Scaling and Alignment Through the Intelliton Lens</a></li>
  </ul>

</div>

<div data-lang="zh">

  <h2 id="section">先用对心智模型，再看谱表</h2>

  <p>读 Intelliton 谱表时，最稳妥的起点是这三句话：</p>

  <ul>
    <li><code class="language-plaintext highlighter-rouge">I_0</code>、<code class="language-plaintext highlighter-rouge">I_1</code>、<code class="language-plaintext highlighter-rouge">I_2</code> 这些名字表示的是<strong>反复出现的模式</strong>，不是字面意义上的粒子</li>
    <li>物理词汇是对行为的压缩描述，不是说模型里藏着量子物质</li>
    <li>最重要的不是名字本身，而是每个模式在计算里扮演了什么角色</li>
  </ul>

  <p>在这里讨论的这份报告里，几个主导模式更像覆盖整段序列的大尺度结构，而不是只绑在某个
token 上的一次性小波纹。它们的质量都偏轻，说明能跨很多层传播；它们的螺旋度代理量也相
对稳定，说明这种方向性签名没有在层间被完全打散。</p>

  <p>这已经很值得注意了：这些模式不是一闪而过的火花，而是可重复使用的内部载体。</p>

  <hr />

  <h2 id="section-1">先把四列术语翻译成人话</h2>

  <p>如果一张谱表看上去很抽象，就先把它压缩成四个问题。</p>

  <h3 id="section-2">动量</h3>

  <p>动量问的是：这个模式沿 token 位置是平滑的，还是快速振荡的？</p>

  <ul>
    <li>低动量表示更全局、更平滑的序列模式</li>
    <li>高动量表示相邻 token 之间变化更快</li>
  </ul>

  <p>这份报告里，重要物种都更靠近低动量端，所以更合适的心智图像不是“某个 token 上的小机关”，
而是“覆盖整个序列的大背景形状”。</p>

  <h3 id="section-3">类自旋分数</h3>

  <p>这不是字面意义上的自旋，更适合读成<strong>内部复杂度</strong>。</p>

  <ul>
    <li>分数低，说明一个内部方向特别突出</li>
    <li>分数高，说明多个方向混在一起，结构更复杂</li>
  </ul>

  <h3 id="section-4">质量</h3>

  <p>质量说的是一个模式会不会随着层数加深而快速衰减。</p>

  <ul>
    <li>轻模式能活很多层</li>
    <li>重模式很快消失</li>
  </ul>

  <p>所以当报告说这些物种都偏轻，本质意思就是：它们不是浅层噪声，而是能一路传播到更深层的
内部模式。</p>

  <h3 id="section-5">螺旋度代理量</h3>

  <p>这里的螺旋度，是传播方向和内部朝向结合起来的一个简化指标。</p>

  <ul>
    <li>稳定说明模式保留了可辨认的方向性签名</li>
    <li>不稳定说明这种签名被混掉了</li>
  </ul>

  <hr />

  <h2 id="i0"><code class="language-plaintext highlighter-rouge">I_0</code>：最强的默认续写底座</h2>

  <p><code class="language-plaintext highlighter-rouge">I_0</code> 是最容易解释的一个物种，因为它既最强，也最简单。</p>

  <p>在这份报告里，它的振幅最大，而且在主导物种里类自旋复杂度最低。最直白的人话解释是：</p>

  <blockquote>
    <p><code class="language-plaintext highlighter-rouge">I_0</code> 很像一个强背景模式，用来保证模型把句子继续往一个合理答案上推进。</p>
  </blockquote>

  <p>它不像某条具体知识，更像一个<strong>通用续写骨架</strong>。当提示词是“如果所有狗都是动物……”或
“7 + 8 等于多少？”这种形式时，<code class="language-plaintext highlighter-rouge">I_0</code> 看起来像是在把“答案槽位”撑开并稳定住的那股力。</p>

  <p>如果硬要压缩成一句话，<code class="language-plaintext highlighter-rouge">I_0</code> 就像模型里那个“先让计算别跑偏”的总底座。</p>

  <hr />

  <h2 id="i1"><code class="language-plaintext highlighter-rouge">I_1</code>：安静但重要的结构支撑</h2>

  <p><code class="language-plaintext highlighter-rouge">I_1</code> 更适合被读成支撑模式，而不是最显眼的决策按钮。</p>

  <p>在干预式阅读里，改动 <code class="language-plaintext highlighter-rouge">I_1</code> 往往不会像改强因果模式那样，立刻把某个词换掉。这<strong>不</strong>代表它
没用，更常见的解释是：它太基础、太基础设施化了，所以表面输出不一定马上剧烈变化。</p>

  <p>更合适的人话解释是：</p>

  <blockquote>
    <p><code class="language-plaintext highlighter-rouge">I_1</code> 像一个维持表示形状和系统稳定性的结构模式，让其他更任务化的模式在上面工作。</p>
  </blockquote>

  <p>它更像脚手架，而不是舞台中央的主角。</p>

  <hr />

  <h2 id="i2"><code class="language-plaintext highlighter-rouge">I_2</code>：带有人物指代偏向的引用解析模式</h2>

  <p><code class="language-plaintext highlighter-rouge">I_2</code> 最好理解的场景，是这类代词提示词：</p>

  <blockquote>
    <p>“Alice gave Bob a book. He thanked her for …”</p>
  </blockquote>

  <p>在这份报告里，放大 <code class="language-plaintext highlighter-rouge">I_2</code> 会把输出往更偏人物、偏男性代词解释的方向推。这让它不像一个通
用语言模式，而更像一条<strong>指代选择通道</strong>。</p>

  <p>更通俗地说：</p>

  <blockquote>
    <p><code class="language-plaintext highlighter-rouge">I_2</code> 看起来会帮助模型决定，这句话现在到底在跟踪哪一个“人”。</p>
  </blockquote>

  <p>这并不意味着它是一个字面意义上的“男性代词粒子”。更稳妥的理解是：在这个探针设置里，只
要模型需要把混杂的语篇上下文压缩成一个明确先行词，<code class="language-plaintext highlighter-rouge">I_2</code> 就会稳定参与进来。</p>

  <hr />

  <h2 id="i3"><code class="language-plaintext highlighter-rouge">I_3</code>：更高复杂度的混合协调模式</h2>

  <p><code class="language-plaintext highlighter-rouge">I_3</code> 不像一个单用途按钮，更像一个负责混合多种约束的协调模式。</p>

  <p>它的类自旋复杂度更高，说明它不是沿着一条干净轴工作，而是由几个同样重要的内部方向共同
构成。这往往出现在模型需要同时维持多个约束的提示词里。</p>

  <p>更合适的人话解释是：</p>

  <blockquote>
    <p><code class="language-plaintext highlighter-rouge">I_3</code> 像是在把几条半成品约束揉成一个可用内部状态的模式。</p>
  </blockquote>

  <p>所以与其把它想成直接拍板某个 token 的按钮，不如把它想成中间层里保持复杂推理或结构理解
不散架的那台“混合器”。</p>

  <hr />

  <h2 id="i4-i2-"><code class="language-plaintext highlighter-rouge">I_4</code>：与 <code class="language-plaintext highlighter-rouge">I_2</code> 互补的另一条指代模式</h2>

  <p><code class="language-plaintext highlighter-rouge">I_4</code> 和 <code class="language-plaintext highlighter-rouge">I_2</code> 有相似之处，但在代词类场景里又带着不同的方向偏好。</p>

  <p>在这份报告里，放大 <code class="language-plaintext highlighter-rouge">I_4</code> 会把输出往 <code class="language-plaintext highlighter-rouge">her</code> 这类形式推，而不是中性或其他续写。更通俗的读
法是：</p>

  <blockquote>
    <p><code class="language-plaintext highlighter-rouge">I_4</code> 也是一个对指代敏感的模式，只是它和 <code class="language-plaintext highlighter-rouge">I_2</code> 在“当前到底在说谁”这个问题上，代表了
不同的语篇落点。</p>
  </blockquote>

  <p>这点很重要，因为它说明“代词跟踪”不是一个整块技能。模型可以把这项工作拆成几条彼此相近、
但又不完全相同的内部模式。</p>

  <hr />

  <h2 id="section-6">把整张谱表压成一段话</h2>

  <p>把 <code class="language-plaintext highlighter-rouge">I_0</code> 到 <code class="language-plaintext highlighter-rouge">I_4</code> 合起来看，故事其实很连贯：</p>

  <ul>
    <li><code class="language-plaintext highlighter-rouge">I_0</code> 是强而通用的背景底座</li>
    <li><code class="language-plaintext highlighter-rouge">I_1</code> 负责稳住结构</li>
    <li><code class="language-plaintext highlighter-rouge">I_2</code> 和 <code class="language-plaintext highlighter-rouge">I_4</code> 更明显地参与指代选择</li>
    <li><code class="language-plaintext highlighter-rouge">I_3</code> 更像复杂约束的混合模式</li>
  </ul>

  <p>这就是 Intelliton 视角的价值。它把一大片难以直视的隐藏状态，压缩成一组反复出现的“角色分工”。</p>

  <hr />

  <h2 id="section-7">哪些地方不要过度解读</h2>

  <p>这里有两个很重要的保留意见。</p>

  <ol>
    <li>物种编号只是记账标签。一次运行里的 <code class="language-plaintext highlighter-rouge">I_2</code>，不保证永远和另一次运行里的 <code class="language-plaintext highlighter-rouge">I_2</code> 完全等价。</li>
    <li>动量、自旋、螺旋度这些词都是代理量。它们是在组织证据，不是在证明网络里真的有字面意
义上的粒子。</li>
  </ol>

  <p>最稳妥的读法是：这些标签在帮我们总结反复出现的激活角色。</p>

  <hr />

  <h2 id="section-8">继续阅读</h2>

  <ul>
    <li><a href="/applications/tasks/interpretation/2026/04/05/why-different-prompts-light-up-different-intellitons.html">为什么不同提示词会点亮不同 Intelliton 模式</a></li>
    <li><a href="/comparison/scaling/alignment/2026/04/03/scaling-alignment-intellitons.html">用 Intelliton 视角看规模扩展与对齐</a></li>
  </ul>

</div>]]></content><author><name>Intellitons Project</name></author><category term="case-study" /><category term="interpretation" /><summary type="html"><![CDATA[Spectrum tables can look intimidating. This article translates a representative Intelliton report into ordinary language and explains what `I_0` to `I_4` are doing without over-reading the physics metaphor.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://intellitons.wiki/assets/icons/android-chrome-512x512.png" /><media:content medium="image" url="https://intellitons.wiki/assets/icons/android-chrome-512x512.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">What Are Intellitons? A Friendly Guide to the Lattice-Field View</title><link href="https://intellitons.wiki/introduction/theory/2026/04/01/what-are-intellitons.html" rel="alternate" type="text/html" title="What Are Intellitons? A Friendly Guide to the Lattice-Field View" /><published>2026-04-01T00:00:00+00:00</published><updated>2026-04-01T00:00:00+00:00</updated><id>https://intellitons.wiki/introduction/theory/2026/04/01/what-are-intellitons</id><content type="html" xml:base="https://intellitons.wiki/introduction/theory/2026/04/01/what-are-intellitons.html"><![CDATA[<div data-lang="en">

  <h2 id="start-with-the-least-mysterious-version">Start with the least mysterious version</h2>

  <p>The Intelliton project is <strong>not</strong> claiming that a language model secretly contains real physical
particles.</p>

  <p>The core idea is simpler and more useful than that: take the transformer residual stream, write it
in a coordinate system that physicists already know how to reason about, and ask whether stable,
recurrent modes appear.</p>

  <p>At one layer, the residual stream is just a matrix:</p>

  <ul>
    <li><code class="language-plaintext highlighter-rouge">T</code> rows for token positions</li>
    <li><code class="language-plaintext highlighter-rouge">D</code> columns for hidden channels</li>
  </ul>

  <p>You can think of it as a long row of sensors. Each token position has thousands of readings. The
question is not whether any single neuron matters, but whether the whole pattern can be compressed
into a small set of reusable modes.</p>

  <hr />

  <h2 id="the-sensor-analogy">The sensor analogy</h2>

  <p>Imagine a sentence with 20 token positions. At each position, instead of one reading, you have a
vector with thousands of numbers. That is what one layer of the residual stream looks like.</p>

  <p>Now ask four very ordinary questions:</p>

  <ol>
    <li>Along the token axis, is the pattern smooth or rapidly oscillating?</li>
    <li>Inside hidden channels, does it point mostly in one direction or is it a messy mixture?</li>
    <li>As layers get deeper, does the pattern survive or die out quickly?</li>
    <li>Does the pattern’s internal structure stay tied to a preferred propagation direction?</li>
  </ol>

  <p>Those four questions become the project’s four main diagnostics:</p>

  <ul>
    <li><strong>Momentum</strong> answers question 1.</li>
    <li><strong>Spin-like complexity</strong> answers question 2.</li>
    <li><strong>Mass</strong> answers question 3.</li>
    <li><strong>Helicity proxy</strong> answers question 4.</li>
  </ul>

  <p>This is why the physics language is useful. It gives a compact way to talk about four different
facets of the same hidden pattern.</p>

  <hr />

  <h2 id="a-new-coordinate-system-not-a-new-ontology">A new coordinate system, not a new ontology</h2>

  <p>The project rewrites the residual stream in a very specific way:</p>

  <ul>
    <li>the <strong>token axis</strong> is treated like a one-dimensional lattice in space,</li>
    <li>the <strong>layer axis</strong> is treated like discrete time,</li>
    <li>the <strong>hidden channels</strong> are treated like internal degrees of freedom.</li>
  </ul>

  <p>That mapping is the whole point. It does not say that text is literally matter. It says that a
familiar toolkit from lattice field theory can be borrowed to organise activation patterns.</p>

  <p>In code, the main definitions live in <code class="language-plaintext highlighter-rouge">src/lattice_field.py</code>, and the overall orchestration sits in
<code class="language-plaintext highlighter-rouge">src/intelliton_analyzer.py</code>.</p>

  <hr />

  <h2 id="what-momentum-means-here">What momentum means here</h2>

  <p>Momentum in this project is just a Fourier description of how a mode varies across token positions.</p>

  <ul>
    <li>If the dominant momentum is near <code class="language-plaintext highlighter-rouge">k = 0</code>, the pattern is broad and smooth across the sequence.</li>
    <li>If the dominant momentum is large, the pattern flips more sharply from one token to the next.</li>
  </ul>

  <p>An everyday analogy is an audio equalizer:</p>

  <ul>
    <li>low frequency means slow, smooth variation,</li>
    <li>high frequency means fast, jagged variation.</li>
  </ul>

  <p>So when a report says a mode is low-momentum, it is usually saying: this is a sequence-scale pattern,
not a tiny local blip tied to one token.</p>

  <hr />

  <h2 id="what-spin-like-complexity-means-here">What spin-like complexity means here</h2>

  <p>This is the term most likely to confuse readers, because it is <strong>not</strong> literal particle spin.</p>

  <p>The project uses SVD to split a layer into dominant modes. In plain language, SVD asks:</p>

  <blockquote>
    <p>Can this complicated activation matrix be explained mostly by one or two big patterns, or do we
need many equally important patterns?</p>
  </blockquote>

  <p>If one mode dominates, the internal structure is simple and concentrated. If energy is spread across
many directions, the structure is more mixed and complex. The blog and code call that a spin-like
quantity, but the safer mental model is simply <strong>internal complexity</strong>.</p>

  <hr />

  <h2 id="what-mass-means-here">What mass means here</h2>

  <p>Mass is the most intuitive part of the analogy.</p>

  <p>The layer axis is treated like discrete time, and the analysis tracks whether a mode’s strength
fades quickly or persists through many layers.</p>

  <ul>
    <li>a <strong>light</strong> mode survives for a long depth range,</li>
    <li>a <strong>heavy</strong> mode dies out quickly.</li>
  </ul>

  <p>So mass in this framework is really a measure of <strong>how easily a pattern propagates through the
network</strong>, not how much it weighs in any everyday sense.</p>

  <hr />

  <h2 id="what-helicity-means-here">What helicity means here</h2>

  <p>Helicity is also a proxy, not a literal high-energy-physics observable.</p>

  <p>The simplified question is: if a mode has a preferred direction on the token lattice, does its
internal structure stay aligned with that direction across layers?</p>

  <p>If yes, the mode has a more stable directional signature. If not, the mode is being scrambled.</p>

  <p>This is useful because two modes can have similar amplitude but very different directional stability.</p>

  <hr />

  <h2 id="why-this-framing-helps">Why this framing helps</h2>

  <p>Once the residual stream is written this way, the project can ask practical questions that are hard
to state cleanly in raw neuron space:</p>

  <ul>
    <li>Which patterns are global versus local across the sequence?</li>
    <li>Which patterns are internally simple versus heavily mixed?</li>
    <li>Which patterns are shallow noise versus deep, persistent carriers?</li>
    <li>Which patterns stay stable across prompts, tasks, and generation steps?</li>
  </ul>

  <p>That is the value of Intellitons. They are a compact language for recurring activation patterns.
They are useful if they organise observations better than a giant pile of raw activations.</p>

  <hr />

  <h2 id="the-shortest-correct-summary">The shortest correct summary</h2>

  <p>If you want the plainest possible version, it is this:</p>

  <blockquote>
    <p>Intellitons are recurring residual-stream modes described in a physics-inspired coordinate system.
DFT tells you how they vary across tokens, SVD tells you how internally concentrated they are,
propagator decay tells you how far they travel across layers, and helicity tells you whether their
internal structure keeps a stable directional signature.</p>
  </blockquote>

  <p>The next article makes that concrete by showing how to read a spectrum report and what <code class="language-plaintext highlighter-rouge">I_0</code> to
<code class="language-plaintext highlighter-rouge">I_4</code> sound like in ordinary language.</p>

  <hr />

  <h2 id="continue-reading">Continue reading</h2>

  <ul>
    <li><a href="/case-study/interpretation/2026/04/02/inside-qwen-intelliton-spectrum.html">How to Read <code class="language-plaintext highlighter-rouge">I_0</code> to <code class="language-plaintext highlighter-rouge">I_4</code></a></li>
    <li><a href="/applications/tasks/interpretation/2026/04/05/why-different-prompts-light-up-different-intellitons.html">Why Different Prompts Light Up Different Intellitons</a></li>
  </ul>

</div>

<div data-lang="zh">

  <h2 id="section">先从最不神秘的版本开始</h2>

  <p>Intelliton 项目<strong>不是</strong>在说语言模型里真的藏着物理粒子。</p>

  <p>更准确、更实用的说法是：把变换器的残差流换到一套物理学家已经很熟悉的坐标系里，再去看
里面会不会出现稳定、反复出现、可以跨层追踪的模式。</p>

  <p>对某一层来说，残差流不过是一个矩阵：</p>

  <ul>
    <li>行数 <code class="language-plaintext highlighter-rouge">T</code> 表示 token 位置</li>
    <li>列数 <code class="language-plaintext highlighter-rouge">D</code> 表示 hidden channels</li>
  </ul>

  <p>你可以把它想成一排传感器。每个 token 位置上都有成千上万个读数。项目真正关心的，不是某
一个神经元是否重要，而是整块信号能不能被少数几个可重复使用的主模式概括出来。</p>

  <hr />

  <h2 id="section-1">最通俗的类比：一排传感器</h2>

  <p>想象一句话有 20 个 token 位置。每个位置上不是一个数字，而是一整个上千维的读数向量。这
就是某一层残差流的大致样子。</p>

  <p>现在问四个很朴素的问题：</p>

  <ol>
    <li>沿着 token 轴，这个模式是平滑变化，还是快速振荡？</li>
    <li>在 hidden channels 里，它更像单一方向，还是复杂混合？</li>
    <li>随着层数加深，它能持续很久，还是很快消失？</li>
    <li>它的内部结构，是否一直和某个传播方向绑定在一起？</li>
  </ol>

  <p>这四个问题，正好对应项目里的四个主诊断量：</p>

  <ul>
    <li><strong>动量</strong> 对应第 1 个问题</li>
    <li><strong>类自旋复杂度</strong> 对应第 2 个问题</li>
    <li><strong>质量</strong> 对应第 3 个问题</li>
    <li><strong>螺旋度代理量</strong> 对应第 4 个问题</li>
  </ul>

  <p>这就是为什么物理语言在这里有用。它把同一批隐藏模式的四个不同侧面，用一套紧凑的词汇串
了起来。</p>

  <hr />

  <h2 id="section-2">这是一套新坐标系，不是一套新本体论</h2>

  <p>项目把残差流这样重写：</p>

  <ul>
    <li><strong>token 轴</strong> 看成一维晶格上的空间</li>
    <li><strong>layer 轴</strong> 看成离散时间</li>
    <li><strong>hidden channels</strong> 看成内部自由度</li>
  </ul>

  <p>重点就在这一步。它不是说文本真的变成了物质，而是说可以借用晶格场论里熟悉的工具，来整
理模型内部的激活模式。</p>

  <p>在代码里，主要定义集中在 <code class="language-plaintext highlighter-rouge">src/lattice_field.py</code>，总流程由 <code class="language-plaintext highlighter-rouge">src/intelliton_analyzer.py</code>
串起来。</p>

  <hr />

  <h2 id="section-3">这里的“动量”到底是什么意思</h2>

  <p>在这个项目里，动量只是描述模式沿 token 位置如何变化的一种傅里叶坐标。</p>

  <ul>
    <li>如果主导动量接近 <code class="language-plaintext highlighter-rouge">k = 0</code>，说明这个模式在整个序列上比较平滑、比较全局。</li>
    <li>如果主导动量较大，说明它在相邻 token 之间切换更快、振荡更强。</li>
  </ul>

  <p>最容易懂的比喻是音频均衡器：</p>

  <ul>
    <li>低频意味着缓慢、平滑的变化</li>
    <li>高频意味着尖锐、快速的起伏</li>
  </ul>

  <p>所以当报告说某个模式是低动量，它通常不是在说“速度慢”，而是在说：这更像一个覆盖整段序
列的大尺度模式，而不是绑在某个 token 上的小噪声。</p>

  <hr />

  <h2 id="section-4">这里的“自旋”为什么其实是在看复杂度</h2>

  <p>这个词最容易让人误会，因为它<strong>不是</strong>粒子物理里的严格自旋。</p>

  <p>项目用 SVD 把某一层拆成若干个主模式。人话版的问题其实是：</p>

  <blockquote>
    <p>这一层看起来很复杂，但它是不是主要由一两个大模式支配，还是说必须靠很多差不多重要的
模式一起才能解释？</p>
  </blockquote>

  <p>如果一个模式特别突出，说明内部结构更集中、更简单。如果能量分散在许多方向上，说明内部
结构更混合、更复杂。博客和代码把这个量借用物理语言叫成 spin-like，但更稳妥的理解就是
<strong>内部复杂度</strong>。</p>

  <hr />

  <h2 id="section-5">这里的“质量”为什么就是跨层能活多久</h2>

  <p>质量是整套类比里最直观的一步。</p>

  <p>项目把 layer 轴当成离散时间，然后看一个模式的强度会不会在更深的层里迅速衰减。</p>

  <ul>
    <li><strong>轻模式</strong> 能持续很多层</li>
    <li><strong>重模式</strong> 很快就消失</li>
  </ul>

  <p>所以这里的质量，实质上是在衡量一个模式<strong>穿透网络深度的能力</strong>，而不是日常意义上的“有多
重”。</p>

  <hr />

  <h2 id="section-6">这里的“螺旋度”为什么只是方向性代理量</h2>

  <p>螺旋度在这里也只是代理量，不是高能物理里那种严格可观测量。</p>

  <p>更简单的问法是：如果某个模式在 token 晶格上有偏好的传播方向，它的内部结构会不会在跨层
传播时一直和这个方向绑在一起？</p>

  <p>如果会，说明这个模式的方向性签名更稳定。如果不会，说明它在层间被打散了。</p>

  <p>这很有用，因为两个模式即使振幅差不多，方向稳定性也可能完全不同。</p>

  <hr />

  <h2 id="section-7">为什么这套说法有帮助</h2>

  <p>一旦把残差流写成这种形式，项目就能提出一些用原始神经元空间很难直接说清的问题：</p>

  <ul>
    <li>哪些模式是全局的，哪些更局部？</li>
    <li>哪些模式内部很集中，哪些高度混合？</li>
    <li>哪些模式只是浅层噪声，哪些能一路传到深层？</li>
    <li>哪些模式能跨提示词、跨任务、跨生成步骤保持稳定？</li>
  </ul>

  <p>Intelliton 的价值就在这里。它提供了一套压缩语言，去描述那些反复出现的激活模式。只要这套
语言比一大堆原始激活更能组织观察结果，它就是有用的。</p>

  <hr />

  <h2 id="section-8">一句话总结这件事</h2>

  <p>如果只保留最通俗也最准确的一句话，那就是：</p>

  <blockquote>
    <p>Intelliton 是用物理启发坐标系描述出来的残差流重复模式。DFT 看它沿 token 怎么变化，SVD
看它内部是否集中，传播子衰减看它能走多深，螺旋度看它的内部结构是否保留稳定的方向性。</p>
  </blockquote>

  <p>下一篇文章会把这件事落到更具体的谱表上，直接教你怎么看 <code class="language-plaintext highlighter-rouge">I_0</code> 到 <code class="language-plaintext highlighter-rouge">I_4</code>。</p>

  <hr />

  <h2 id="section-9">继续阅读</h2>

  <ul>
    <li><a href="/case-study/interpretation/2026/04/02/inside-qwen-intelliton-spectrum.html">怎么看 <code class="language-plaintext highlighter-rouge">I_0</code> 到 <code class="language-plaintext highlighter-rouge">I_4</code></a></li>
    <li><a href="/applications/tasks/interpretation/2026/04/05/why-different-prompts-light-up-different-intellitons.html">为什么不同提示词会点亮不同 Intelliton 模式</a></li>
  </ul>

</div>]]></content><author><name>Intellitons Project</name></author><category term="introduction" /><category term="theory" /><summary type="html"><![CDATA[Intellitons are not claims that language models literally contain particles. They are a practical way to redescribe the residual stream using a lattice-field coordinate system that makes recurring modes easier to see, compare, and talk about.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://intellitons.wiki/assets/icons/android-chrome-512x512.png" /><media:content medium="image" url="https://intellitons.wiki/assets/icons/android-chrome-512x512.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>