A.4 推荐阅读与参考文献
核心论文
Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017.
Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.
Radford, A., et al. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI.
Brown, T., et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020.
架构改进
Su, J., et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding.
Dao, T., et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.
Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need.
训练与对齐
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.
Rafailov, R., et al. (2023). Direct Preference Optimization. NeurIPS 2023.
Hu, E., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.
推理优化
Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
Leviathan, Y., et al. (2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023.
前沿架构
Gu, A. & Dao, T. (2024). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
DeepSeek-AI. (2025). DeepSeek-V3 Technical Report.
教程与可视化
Jay Alammar. The Illustrated Transformer.
Lilian Weng. The Transformer Family.
HuggingFace. Transformers Documentation.
推荐书籍
Jurafsky, D. & Martin, J.H. Speech and Language Processing (3rd ed.). 第10章 Transformer 部分。
邱锡鹏. 《神经网络与深度学习》. 第15章注意力机制与 Transformer。
最后更新于
