# 本章小结

**SFT** 通过在指令-回答数据上训练，教会模型按人类期望的格式进行对话。数据质量远比数量重要。

**RLHF** 通过人类偏好标注训练奖励模型，再用 PPO 优化语言模型以生成更受人类偏好的回答。有效但流程复杂、训练不稳定。

**DPO** 在标准 KL 正则化偏好优化假设下可重写 RLHF 目标，省去了显式奖励模型和 PPO，直接在偏好数据上优化，更简洁高效。GRPO、DAPO、GSPO、KTO、ORPO、Constitutional AI/RLAIF 等方法从推理 RL、偏好优化损失和 AI 反馈三个方向拓展了对齐技术的设计空间。

\*\*参数高效微调（LoRA 等）\*\*利用微调更新的低秩特性，只训练极少量参数即可达到接近全参数微调的效果，大幅降低了微调的资源门槛。

至此，第二部分“训练篇”结束。下一部分将进入“推理与部署篇”，关注如何让训练好的模型高效地为用户服务。

***

> 📝 **发现错误或有改进建议？** 欢迎提交 [Issue](https://github.com/yeasy/llm_internals/issues) 或 [PR](https://github.com/yeasy/llm_internals/pulls)。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://yeasy.gitbook.io/llm_internals/di-er-bu-fen-xun-lian-pian/08_alignment/summary.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.