# 本章小结

**推理瓶颈的第一性分析**揭示生成阶段是访存密集型的——GPU 大部分时间在等待数据加载。

**KV 缓存**避免了重复计算之前词元的 Key/Value，**GQA** 通过让多个查询头共享 KV 头来减小缓存大小。**PagedAttention** 借鉴虚拟内存管理降低显存碎片化，**MLA** 进一步将 KV 压缩为低维隐向量。**Flash Attention** 通过 IO 感知的分块算法避免在 HBM 中存储完整的 $$n \times n$$ 注意力矩阵，并在 **Ring Attention** 中支持分布式超长序列推理。

**模型量化**（INT8/INT4）通过减少每个参数的位宽来降低访存瓶颈。**剪枝和蒸馏**分别通过删减参数和压缩模型来减小体积。

**投机解码**通过“先猜后验”打破了逐词元生成的瓶颈；在草稿模型匹配、batch 较低、采样配置合适时，可以在保持目标分布不变的前提下达到论文报告的 2-3 倍加速。生产收益仍需按流量和延迟目标实测。

下一章将讨论如何将这些优化技术整合到完整的推理引擎和生产部署方案中。

***

> 📝 **发现错误或有改进建议？** 欢迎提交 [Issue](https://github.com/yeasy/llm_internals/issues) 或 [PR](https://github.com/yeasy/llm_internals/pulls)。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/10_inference_optimization/summary.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
