# 本章小结

本章解析了 Transformer 的文本预处理流程以及除注意力机制之外的所有核心组件，揭示了每个技术环节存在的必要性和设计原理。

**分词**作为模型处理文本的第一步，通过 BPE 等子词分割算法在字符级与词级之间取得了平衡，解决了未登录词问题并提高了覆盖率。

**词嵌入**将离散的词元索引映射为稠密的连续向量，使数学运算成为可能。早期的静态嵌入发展为 Transformer 中的初始表示，而缩放设计（如乘以 $$\sqrt{d\_{\text{model}}}$$）可确保词义信息不被位置编码淹没。

**位置编码**弥补了注意力机制的置换等变性缺陷。由于自注意力本身无法区分不同顺序的序列，位置信息必须通过外部编码显式注入。

**前馈网络**提供了注意力层之外的逐位置非线性变换和通道混合能力，也是若干事实关联较容易被局部化分析的位置。其“先升维后降维”的沙漏结构在扩大表示容量的同时保持了维度一致性。

**残差连接**通过建立梯度的“高速公路”，解决了深层网络的退化问题，使 Transformer 能够堆叠到数十乃至上百层。维度一致性的要求也解释了为什么所有层的输出维度都保持为 $$d\_{\text{model}}$$。

**层归一化**稳定了训练过程中的数值分布。选择 LayerNorm 而非 BatchNorm 是因为序列任务的特殊性——变长输入和小批次使 BatchNorm 不适用。现代模型普遍采用 Pre-Norm 配置和更高效的 RMSNorm。

**三种架构变体**——仅编码器、仅解码器、编码器-解码器——各有适用场景。仅解码器架构凭借良好的扩展性和通用性，成为了大语言模型时代的主流选择。

下一章将深入讨论位置编码的不同方案及其设计哲学——这是近年来 Transformer 架构中演化最为活跃的领域之一。

***

> 📝 **发现错误或有改进建议？** 欢迎提交 [Issue](https://github.com/yeasy/llm_internals/issues) 或 [PR](https://github.com/yeasy/llm_internals/pulls)。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://yeasy.gitbook.io/llm_internals/di-yi-bu-fen-ji-chu-pian/03_components/summary.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.