# 14.6 持续迭代与改进

上线不是终点，而是数据闭环的起点。

## 14.6.1 评估体系

如何知道系统的回答好不好？不能只靠人的感觉。我们使用 **RAGAS** 框架进行自动化量化评估。 主要指标包括：

* **Faithfulness (忠实度)**：回答是否忠实于检索到的上下文？（防幻觉）
* **Answer Relevance (回答相关性)**：回答是否解决了用户的问题？
* **Context Precision (上下文准确率)**：检索到的内容是否真的相关？

本章配套实验提供一个最小评估集，先验证“来源是否命中”和“答案是否包含关键事实”两个底线指标：

```bash
python3 examples/enterprise_know/enterprise_know.py --eval
python3 -m unittest examples/enterprise_know/test_enterprise_know.py
```

这两个命令分别验证“评估集是否通过”和“核心管道是否回归通过”。在本地最小实验中，评估输出应包含 `passed: true`，并且 `source_hit_rate` 与 `answer_term_hit_rate` 都为 `1.0`；单元测试应以 `OK` 结束。如果后续扩展了评估集，不应只看总分，还要保留失败 case 的输入、期望来源、实际来源和缺失关键词，作为下一轮调试的起点。

生产评估不应停留在单一总分。建议把评估集按部门、文档类型、问题意图、权限级别和高风险主题分层，分别设置阈值；每次索引、提示词、模型或重排序策略变更，都应记录评估集版本、失败样例、回滚条件和人工抽检结论。

## 14.6.2 反馈闭环

在界面上设计点赞/点踩（👍/👎）按钮。

* 收集用户的负反馈（Bad Case）。
* 人工分析原因：是知识库内容错误、切分/索引问题、检索没找对、重排序失败、提示词/输出 Schema 不清，还是 LLM 理解错了？
* 优先修正知识库、索引、检索/重排序、提示词、工具参数和评估集。只有当问题属于稳定的风格、分类或行为模式，且数据质量、隐私授权和标注一致性通过审查后，才考虑微调 Embedding 模型或 LLM。

## 14.6.3 全链路监控

接入可观测性工具（如 LangSmith、LangFuse 等），记录每一次调用的完整 Trace（以工具版本与隐私合规要求为准）。 监控指标：

* Token 消耗与成本
* P99 延迟
* Rerank 后的平均相关性得分

通过持续的数据观测和迭代，我们的知识库系统才能真正成为企业的“最强大脑”。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://yeasy.gitbook.io/context_engineering_guide/di-si-bu-fen-gong-cheng-shi-zhan-yu-wei-lai-yan-jin/14_practice/14.6_optimization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.