# 本章小结

本章深入解析了 Transformer 预训练的核心范式和背后的设计逻辑。

**自回归语言模型**通过“预测下一个词元”这一看似简单的目标，迫使模型学习语法、语义、常识和推理能力。这种通用的无监督学习信号是当前大语言模型最主流的预训练方式。

\*\*BERT 的掩码语言模型（MLM）\*\*通过随机遮盖并预测被遮盖的词元，实现了双向上下文建模。三重遮盖策略（80% MASK/10% 随机/10% 不变）的设计缓解了模型对特定标记的过度依赖，但只在被遮盖位置计算主要预测损失，使训练信号密度低于逐词自回归方式。

**编码器-解码器预训练**（如 T5 和 BART）统一了理解和生成能力，但在超大规模下被更简洁的纯解码器架构所超越。

**规模定律**揭示了模型性能与参数量、数据量之间的幂律关系，Chinchilla 原则指出在固定训练计算下参数和数据应协同增长。数据质量与治理（去重、过滤、领域混合、污染控制和许可追踪）对模型能力有着与规模同样重要的影响。

下一章将讨论训练过程中的具体技术——损失函数、优化器、学习率调度等，理解这些“底层逻辑”对成功训练大模型至关重要。

***

> 📝 **发现错误或有改进建议？** 欢迎提交 [Issue](https://github.com/yeasy/llm_internals/issues) 或 [PR](https://github.com/yeasy/llm_internals/pulls)。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://yeasy.gitbook.io/llm_internals/di-er-bu-fen-xun-lian-pian/05_pretraining/summary.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.