# 本章小结

本章解析了 Transformer 训练过程中关键技术的底层逻辑。

**交叉熵损失**衡量模型预测与真实分布的差异，标签平滑通过软化目标分布防止过度自信。**Adam 优化器**通过维护梯度的一阶矩和偏差修正后的二阶矩，实现参数级别的自适应学习率，解决了 SGD 在复杂损失地形上的局限。**AdamW** 通过将权重衰减与自适应学习率解耦，确保了统一而有效的正则化强度。

**学习率先预热再衰减**不是经验调参而是针对 Transformer 训练动态的必要设计：预热阶段通过线性增长学习率来稳定优化器的矩估计和模型初始化，衰减阶段按逆平方根缓慢衰减以确保收敛到精细的最优点。现代策略包括余弦退火和“预热-稳定-衰减”三阶段方案。

**正则化与训练稳定性**中，Dropout 防止神经元共适应（但超大模型通常省略），梯度裁剪防止梯度爆炸导致的训练崩溃（对深度网络和长序列尤为重要），权重衰减保持参数在合理范围内。**梯度累积**允许在显存约束下使用大的有效批次大小，是现代大规模训练的标准做法。

**批次大小和序列长度**的选择是效率与质量的权衡：更大的批次提高吞吐量但可能导致泛化能力下降，序列打包和动态长度策略提升了训练效率。显存分析显示优化器状态（Adam 的一阶和二阶矩）是最大消耗者，这驱动了 ZeRO 等分布式优化技术的发展。

下一章将讨论如何将这些训练技术扩展到多 GPU 甚至多节点的大规模分布式训练场景。

***

> 📝 **发现错误或有改进建议？** 欢迎提交 [Issue](https://github.com/yeasy/llm_internals/issues) 或 [PR](https://github.com/yeasy/llm_internals/pulls)。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://yeasy.gitbook.io/llm_internals/di-er-bu-fen-xun-lian-pian/06_training_techniques/summary.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.