# 第三部分：推理与部署篇

- [第九章：解码策略：模型如何生成文本](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/09_decoding.md)
- [9.1 自回归解码：逐词生成的机制](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/09_decoding/9.1_autoregressive_decode.md)
- [9.2 贪心搜索与束搜索：确定性与近似搜索](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/09_decoding/9.2_greedy_beam.md)
- [9.3 采样策略：温度、Top-k 与 Top-p 的设计直觉](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/09_decoding/9.3_sampling.md)
- [9.4 结构化输出与约束解码](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/09_decoding/9.4_constrained.md)
- [9.5 解码侧的推理时扩展：生成、搜索与验证](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/09_decoding/9.5_test_time_scaling.md)
- [本章小结](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/09_decoding/summary.md)
- [第十章：推理优化：第一性原理的分析](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/10_inference_optimization.md)
- [10.1 推理瓶颈分析：计算密集还是访存密集](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/10_inference_optimization/10.1_bottleneck.md)
- [10.2 KV 缓存：为什么能避免重复计算](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/10_inference_optimization/10.2_kv_cache.md)
- [10.3 Flash Attention：IO 感知的算法设计](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/10_inference_optimization/10.3_flash_attention.md)
- [10.4 模型量化：用更少的位数表示权重与激活值](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/10_inference_optimization/10.4_quantization.md)
- [10.5 剪枝与知识蒸馏：模型瘦身的两条路](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/10_inference_optimization/10.5_pruning_distillation.md)
- [10.6 投机解码：为什么“先猜后验”能加速](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/10_inference_optimization/10.6_speculative_decoding.md)
- [本章小结](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/10_inference_optimization/summary.md)
- [第十一章：推理引擎与生产部署](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/11_serving.md)
- [11.1 推理引擎架构概览](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/11_serving/11.1_engines_overview.md)
- [11.2 连续批处理与 PagedAttention](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/11_serving/11.2_continuous_batching.md)
- [11.3 分离式 Prefill-Decode 架构](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/11_serving/11.3_disaggregated_serving.md)
- [11.4 硬件选型：GPU、TPU 与专用加速器](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/11_serving/11.4_hardware.md)
- [11.5 生产部署最佳实践](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/11_serving/11.5_best_practices.md)
- [本章小结](https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian/11_serving/summary.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://yeasy.gitbook.io/llm_internals/di-san-bu-fen-tui-li-yu-bu-shu-pian.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
