# 本章小结

本章深入解析了 Transformer 的核心组件——注意力机制。核心要点如下：

**查询-键-值的设计直觉**：注意力机制类似于信息检索——Query 表示“想找什么”，Key 表示“提供什么标识”，Value 表示“提供什么内容”。三重投影让每个位置拥有三种独立的表示，分别服务于不同的语义角色。

**缩放因子** $$\sqrt{d\_k}$$ **的必要性**：点积的方差随维度 $$d\_k$$ 线性增长，高维情况下会导致 Softmax 进入饱和区（梯度趋近于零）。除以 $$\sqrt{d\_k}$$ 将点积的方差重新标准化为 1，这不是经验性调参，而是统计分析推导出的数学必然。

**多头注意力的优势**：通过将表示空间分解为多个子空间并行计算，多头注意力能同时捕捉不同类型的关系（语法、语义、位置等），且总计算量不增加。不同的头在训练后自发地学习了不同的功能。

**因果掩码的设计逻辑**：在自回归生成任务中，模型不能“看到未来”。因果掩码通过将未来位置的注意力分数设为 $$-\infty$$，确保信息只能从左到右流动。

**平方复杂度是核心权衡**：自注意力以 $$O(n^2)$$ 的计算和内存代价换来了 $$O(1)$$ 的信息路径长度和完全并行能力。这一权衡在短到中等长度序列上非常有利，但对超长序列构成瓶颈，催生了大量后续优化工作。

下一章将继续解析 Transformer 的其他核心组件——词嵌入、前馈网络、残差连接和层归一化——它们与注意力机制协同工作，共同构成了完整的 Transformer 架构。

***

> 📝 **发现错误或有改进建议？** 欢迎提交 [Issue](https://github.com/yeasy/llm_internals/issues) 或 [PR](https://github.com/yeasy/llm_internals/pulls)。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://yeasy.gitbook.io/llm_internals/di-yi-bu-fen-ji-chu-pian/02_attention/summary.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.