# 6.2 学习率调度：为什么需要先预热再衰减

学习率可能是 Transformer 训练中最关键的超参数。原始论文提出的“先预热再衰减”策略不是经验性的 trick，而是针对 Transformer 特定训练动态的必要设计。

## 6.2.1 原始调度公式

Transformer 的学习率调度公式为：

$$\text{lr} = d\_{\text{model}}^{-0.5} \cdot \min(\text{step}^{-0.5}, \text{step} \cdot \text{warmup\_steps}^{-1.5})$$

这个公式通过 min 函数的“切换”实现两个阶段的平稳过渡。在 step = warmup\_steps 时，两项相等：

$$\text{warmup\_steps}^{-0.5} = \text{warmup\_steps} \cdot \text{warmup\_steps}^{-1.5}$$

因此：

* **预热阶段**（step < warmup\_steps）：$\text{step} \cdot \text{warmup\_steps}^{-1.5}$ 较小，学习率按此项线性增长：$$\text{lr} \propto \text{step}$$
* **衰减阶段**（step ≥ warmup\_steps）：$\text{step}^{-0.5}$ 较小，学习率按此项逐渐衰减：$$\text{lr} \propto \text{step}^{-0.5}$$

原始论文使用 warmup\_steps = 4000。峰值出现在 step = warmup\_steps，此时学习率为 $d\_{\text{model}}^{-0.5} \cdot \text{warmup\_steps}^{-0.5}$；对于 d\_model = 512、warmup\_steps = 4000，峰值约为 $512^{-0.5} \cdot 4000^{-0.5} \approx 6.99 \times 10^{-4}$。$512^{-0.5} \approx 0.0442$ 只是模型维度缩放因子，不是最终峰值学习率。

## 6.2.2 为什么需要预热

训练初期，模型参数是随机初始化的，此时的梯度方向**高度不稳定**——可能指向完全错误的方向。如果一开始就使用较大的学习率，模型可能在参数空间中进行剧烈的、无方向性的跳跃，导致：

1. **Adam 的估计不准确**：二阶矩 $$v\_t$$ 在初始阶段样本不足，估计不可靠。过大的学习率会放大这种不准确性
2. **层归一化的不稳定**：训练初期的极端梯度更新可能导致归一化层的统计量剧烈波动
3. **注意力分布的退化**：过大的更新可能导致注意力权重坍塌为近似均匀分布或过度集中

预热通过在初始阶段使用微小的学习率，让模型“热身”——先建立基本的特征表示和稳定的梯度统计，然后再逐步增大学习率以加快训练。

## 6.2.3 为什么需要衰减

在训练后期，模型已经接近了损失函数的局部最优区域。此时如果继续使用较大的学习率，模型会在最优点附近**反复震荡**而无法收敛。逐步降低学习率让模型的更新步长越来越精细，最终“平稳着陆”到一个好的解。

## 6.2.4 现代调度策略

除了原始的逆平方根衰减，现代大模型训练中常用的调度策略还包括：

**余弦退火**（Cosine Annealing）：学习率按余弦函数从峰值平滑下降到接近零。这种方式在训练中期保持较高的学习率，有助于跳出局部最优。

**线性衰减**（Linear Decay）：从峰值线性下降到零。简单直接，效果通常不亚于余弦退火。

**预热-稳定-衰减**（Warmup-Stable-Decay，WSD）：DeepSeek-V3 等模型采用的三阶段策略——预热、以恒定峰值学习率训练较长时间、最后快速衰减。这种策略在大规模训练中被证明非常有效。

无论选择哪种策略，**预热阶段都是必须的**——这已经成为 Transformer 训练的普遍共识。

## 6.2.5 学习率调度可视化

不同的调度策略在训练过程中的学习率走势差异显著。下面的代码在同一张图中绘制三种主流策略（原始逆平方根、余弦退火、WSD 三阶段）的学习率曲线，帮助读者直观对比它们的行为：

```python
import torch
import matplotlib.pyplot as plt
import math

total_steps = 20000
warmup_steps = 2000
d_model = 512
peak_lr = 1e-3

steps = torch.arange(1, total_steps + 1).float()

# 策略 1：原始 Transformer 逆平方根调度
lr_original = d_model ** (-0.5) * torch.minimum(
    steps ** (-0.5),
    steps * warmup_steps ** (-1.5)
)

# 策略 2：余弦退火（带预热）
lr_cosine = torch.zeros(total_steps)
for s in range(total_steps):
    if s < warmup_steps:
        lr_cosine[s] = peak_lr * (s + 1) / warmup_steps
    else:
        progress = (s - warmup_steps) / (total_steps - warmup_steps)
        lr_cosine[s] = peak_lr * 0.5 * (1 + math.cos(math.pi * progress))

# 策略 3：WSD 三阶段（预热-稳定-衰减）
stable_end = int(total_steps * 0.8)   # 80% 的步数保持稳定
decay_start = stable_end
lr_wsd = torch.zeros(total_steps)
for s in range(total_steps):
    if s < warmup_steps:
        lr_wsd[s] = peak_lr * (s + 1) / warmup_steps
    elif s < decay_start:
        lr_wsd[s] = peak_lr
    else:
        progress = (s - decay_start) / (total_steps - decay_start)
        lr_wsd[s] = peak_lr * 0.5 * (1 + math.cos(math.pi * progress))

plt.figure(figsize=(10, 5))
plt.plot(steps.numpy(), lr_original.numpy(), label="逆平方根（原始 Transformer）",
         linewidth=1.5)
plt.plot(steps.numpy(), lr_cosine.numpy(), label="余弦退火", linewidth=1.5)
plt.plot(steps.numpy(), lr_wsd.numpy(), label="WSD 三阶段", linewidth=1.5)
plt.axvline(x=warmup_steps, color="gray", linestyle=":", alpha=0.5,
            label=f"预热结束（step {warmup_steps}）")
plt.xlabel("训练步数")
plt.ylabel("学习率")
plt.title("三种学习率调度策略对比")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("_images/lr_schedule_comparison.png", dpi=150)
plt.show()
```

![不同学习率调度策略的变化曲线对比](/files/QWNA3mC8dWtHHHBQgZWN)

图 6-1：三种学习率调度策略的曲线对比

从图中可以清晰看到三种策略的不同特征：**逆平方根**衰减最为平缓，在预热后缓慢下降；**余弦退火**在中期保持较高学习率，后期快速下降到接近零；**WSD 三阶段**在大部分训练过程中维持峰值学习率，只在最后 20% 快速衰减。三种策略都共享相同的预热阶段（灰线标注），再次印证了预热对 Transformer 训练的重要性。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://yeasy.gitbook.io/llm_internals/di-er-bu-fen-xun-lian-pian/06_training_techniques/6.2_lr_schedule.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.