# 13.1 Harness 评估方法论

> **行业现状**
>
> LangChain《State of Agent Engineering》调研数据（1300+从业者）：
>
> * **52.4%** 的组织在测试集上运行离线评估
> * **37.3%** 实施在线评估(online evaluations)
> * **22.8%** 的生产阶段组织根本不做评估
> * **关键落差**：可观测性采用率89% vs 评估采用率52%
>
> 这说明大多数团队能“看到”智能体在做什么，却缺乏系统化方法来判断“做得好不好”。评估是当前最被低估的投入领域。

## 13.1.1 评估的三个层级

智能体系统的质量评估需要在三个层级进行，从细到粗：

```mermaid
graph LR
    A["<b>步骤级</b><br/>工具调用准确率"] -->|汇聚| B["<b>轨迹级</b><br/>轨迹效率"]
    B -->|汇聚| C["<b>任务级</b><br/>任务完成率"]
```

图 13-1：智能体三层评估架构

### 层级1：步骤级评估

**定义**：评估单个工具调用是否正确。

**关键问题**：

* 工具选择是否正确？（调用了合适的工具吗？）
* 参数是否正确？（参数值是否符合预期？）
* 工具执行是否成功？（有无错误？）

**评估指标**：

| 指标    | 定义         | 计算                               |
| ----- | ---------- | -------------------------------- |
| 工具准确率 | 选择正确工具的比例  | correct\_tools / total\_calls    |
| 参数准确率 | 参数正确的比例    | correct\_params / total\_params  |
| 执行成功率 | 工具调用不出错的比例 | successful\_calls / total\_calls |

**计算示例**：

```python
from typing import List

def evaluate_step_level(trajectory: List["ToolCall"],
                       ground_truth: List["ToolCall"]) -> dict:
    """步骤级评估"""

    correct_tools = 0
    correct_calls = 0
    successful = 0
    total = len(trajectory)

    for i, predicted_call in enumerate(trajectory):
        # 工具是否正确
        if predicted_call.tool_name == ground_truth[i].tool_name:
            correct_tools += 1

        # 调用是否完全正确(工具+参数)
        if (predicted_call.tool_name == ground_truth[i].tool_name and
            predicted_call.args == ground_truth[i].args):
            correct_calls += 1

        # 执行是否成功
        if predicted_call.success:
            successful += 1

    return {
        "tool_accuracy": correct_tools / total,
        "call_accuracy": correct_calls / total,
        "execution_success_rate": successful / total,
    }
```

### 层级2：轨迹级评估

**定义**：评估工具调用序列的效率和正确性。

**关键问题**：

* 是否走了弯路？（多余的调用？）
* 调用顺序是否高效？（能否更快完成？）
* 是否自我纠正？（遇到错误如何反应？）

**评估指标**：

| 指标     | 定义             | 计算                                    |
| ------ | -------------- | ------------------------------------- |
| 轨迹长度效率 | 实际调用数 vs 最优调用数 | optimal\_steps / actual\_steps        |
| 错误恢复率  | 遇到工具错误后成功恢复的比例 | recovered\_errors / total\_errors     |
| 重复调用率  | 相同工具连续调用的次数    | duplicate\_calls / total\_calls       |
| 平均调用深度 | 完成任务所需的平均步骤数   | sum(trajectory\_lengths) / num\_tasks |

**计算示例**：

```python
from typing import List

def evaluate_trajectory_level(trajectory: List["ToolCall"],
                              optimal_trajectory: List["ToolCall"]) -> dict:
    """轨迹级评估"""

    actual_length = len(trajectory)
    optimal_length = len(optimal_trajectory)

    # 调用效率
    efficiency = optimal_length / actual_length if actual_length > 0 else 0

    # 错误恢复
    errors = sum(1 for call in trajectory if not call.success)
    recovered = sum(1 for i, call in enumerate(trajectory)
                   if not call.success and i+1 < len(trajectory) and trajectory[i+1].success)
    recovery_rate = recovered / errors if errors > 0 else 0

    # 重复调用
    duplicates = sum(1 for i in range(len(trajectory)-1)
                    if trajectory[i].tool_name == trajectory[i+1].tool_name)
    duplicate_rate = duplicates / actual_length if actual_length > 0 else 0

    return {
        "efficiency": efficiency,
        "error_recovery_rate": recovery_rate,
        "duplicate_rate": duplicate_rate,
        "trajectory_length": actual_length,
        "optimal_length": optimal_length,
    }
```

### 层级3：任务级评估

**定义**：评估是否完成了用户的最终目标。

**关键问题**：

* 最终答案是否正确？
* 完成任务的成功率是否足够高？
* 花费的资源（token、时间、成本）是否可接受？

**评估指标**：

| 指标      | 定义             | 范围     |
| ------- | -------------- | ------ |
| 任务成功率   | 完成任务的比例        | 0-100% |
| 执行时间    | 完成任务的平均时间      | 秒      |
| Token效率 | 平均每个任务消耗的Token | Token数 |
| 成本效率    | 平均每个任务的API成本   | 美元     |
| 用户满意度   | 用户对结果的满意评分     | 1-5分   |

**计算示例**：

```python
from datetime import datetime
from typing import List

def evaluate_task_level(results: List["TaskResult"]) -> dict:
    """任务级评估"""

    successful = sum(1 for r in results if r.success)
    total = len(results)

    # 任务成功率
    success_rate = successful / total

    # 执行时间
    execution_times = [r.end_time - r.start_time for r in results]
    avg_time = sum(execution_times) / len(execution_times)

    # Token效率
    tokens = [r.tokens_used for r in results]
    avg_tokens = sum(tokens) / len(tokens)

    # 成本效率(假设0.001美元/1000Token)
    costs = [t * 0.001 / 1000 for t in tokens]
    total_cost = sum(costs)
    avg_cost = total_cost / len(results)

    return {
        "success_rate": success_rate,
        "avg_execution_time_sec": avg_time.total_seconds(),
        "avg_tokens_per_task": avg_tokens,
        "total_cost_usd": total_cost,
        "avg_cost_per_task": avg_cost,
    }
```

## 13.1.2 评估指标体系

综合三个层级，构建完整的评估指标体系：

```python
from dataclasses import dataclass

@dataclass
class EvaluationMetrics:
    """完整的评估指标集"""

    # ======== 步骤级 ========
    tool_accuracy: float              # 工具选择准确率
    parameter_accuracy: float         # 参数准确率
    execution_success_rate: float     # 执行成功率

    # ======== 轨迹级 ========
    trajectory_efficiency: float      # 轨迹效率(最优/实际)
    error_recovery_rate: float        # 错误恢复率
    duplicate_rate: float             # 重复调用率

    # ======== 任务级 ========
    task_success_rate: float          # 任务成功率
    avg_execution_time: float         # 平均执行时间(秒)
    avg_tokens_per_task: float        # 平均Token消耗
    avg_cost_per_task: float          # 平均成本(美元)

    # ======== 综合指标 ========
    overall_quality_score: float      # 综合质量评分(0-100)

    def compute_overall_score(self) -> float:
        """
        计算综合质量评分

        权重配置：
        - 任务成功率：40%(最关键)
        - 轨迹效率：30%(关键)
        - 步骤准确率：20%(基础)
        - 成本效率：10%(现实考虑)
        """
        score = (
            self.task_success_rate * 0.4 * 100 +
            self.trajectory_efficiency * 0.3 * 100 +
            ((self.tool_accuracy + self.parameter_accuracy) / 2) * 0.2 * 100 +
            max(0, (1 - self.avg_cost_per_task / 0.1)) * 0.1 * 100  # 假设0.1美元为基准
        )
        return min(100, max(0, score))
```

## 13.1.3 评估框架架构

评估框架的核心架构实现：

```python
from abc import ABC, abstractmethod
from typing import Any, Dict, List
import logging

logger = logging.getLogger(__name__)

class Evaluator(ABC):
    """评估器基类"""

    @abstractmethod
    def evaluate(self, prediction: Any, ground_truth: Any) -> Dict[str, float]:
        """评估单个样本"""
        pass

    @abstractmethod
    def aggregate(self, results: List[Dict[str, float]]) -> EvaluationMetrics:
        """聚合多个样本的评估结果"""
        pass

class StepLevelEvaluator(Evaluator):
    """步骤级评估器"""

    def evaluate(self, predicted_call: ToolCall, ground_truth_call: ToolCall) -> Dict[str, float]:
        """评估单个工具调用"""
        tool_correct = predicted_call.tool_name == ground_truth_call.tool_name
        params_correct = predicted_call.args == ground_truth_call.args
        success = predicted_call.success

        return {
            "tool_correct": 1.0 if tool_correct else 0.0,
            "params_correct": 1.0 if params_correct else 0.0,
            "execution_success": 1.0 if success else 0.0,
        }

    def aggregate(self, results: List[Dict[str, float]]) -> Dict[str, float]:
        """聚合结果"""
        if not results:
            return {"tool_accuracy": 0, "param_accuracy": 0, "success_rate": 0}

        tool_sum = sum(r["tool_correct"] for r in results)
        param_sum = sum(r["params_correct"] for r in results)
        success_sum = sum(r["execution_success"] for r in results)
        total = len(results)

        return {
            "tool_accuracy": tool_sum / total,
            "param_accuracy": param_sum / total,
            "success_rate": success_sum / total,
        }

class TrajectoryLevelEvaluator(Evaluator):
    """轨迹级评估器"""

    def evaluate(self, trajectory: List[ToolCall],
                optimal_trajectory: List[ToolCall]) -> Dict[str, float]:
        """评估工具调用序列"""

        actual_len = len(trajectory)
        optimal_len = len(optimal_trajectory)
        efficiency = optimal_len / actual_len if actual_len > 0 else 0

        errors = sum(1 for call in trajectory if not call.success)
        if errors > 0:
            recovery = sum(1 for i in range(len(trajectory)-1)
                         if not trajectory[i].success and trajectory[i+1].success)
            recovery_rate = recovery / errors
        else:
            recovery_rate = 0

        duplicates = sum(1 for i in range(len(trajectory)-1)
                        if trajectory[i].tool_name == trajectory[i+1].tool_name)
        duplicate_rate = duplicates / actual_len if actual_len > 0 else 0

        return {
            "efficiency": efficiency,
            "recovery_rate": recovery_rate,
            "duplicate_rate": duplicate_rate,
        }

    def aggregate(self, results: List[Dict[str, float]]) -> Dict[str, float]:
        """聚合结果"""
        if not results:
            return {"avg_efficiency": 0, "avg_recovery_rate": 0, "avg_duplicate_rate": 0}

        avg_efficiency = sum(r["efficiency"] for r in results) / len(results)
        avg_recovery = sum(r["recovery_rate"] for r in results) / len(results)
        avg_duplicate = sum(r["duplicate_rate"] for r in results) / len(results)

        return {
            "avg_efficiency": avg_efficiency,
            "avg_recovery_rate": avg_recovery,
            "avg_duplicate_rate": avg_duplicate,
        }

class TaskLevelEvaluator(Evaluator):
    """任务级评估器"""

    def evaluate(self, result: TaskResult) -> Dict[str, float]:
        """评估单个任务"""
        success = 1.0 if result.success else 0.0
        time_seconds = (result.end_time - result.start_time).total_seconds()
        cost = result.tokens_used * 0.001 / 1000

        return {
            "success": success,
            "time_seconds": time_seconds,
            "tokens": float(result.tokens_used),
            "cost": cost,
        }

    def aggregate(self, results: List[Dict[str, float]]) -> Dict[str, float]:
        """聚合结果"""
        if not results:
            return {"success_rate": 0, "avg_time": 0, "avg_tokens": 0, "avg_cost": 0}

        success_sum = sum(r["success"] for r in results)
        success_rate = success_sum / len(results)

        avg_time = sum(r["time_seconds"] for r in results) / len(results)
        avg_tokens = sum(r["tokens"] for r in results) / len(results)
        avg_cost = sum(r["cost"] for r in results) / len(results)

        return {
            "success_rate": success_rate,
            "avg_time_seconds": avg_time,
            "avg_tokens": avg_tokens,
            "avg_cost_usd": avg_cost,
        }
```

## 13.1.4 评估结果可视化

评估结果的可视化实现代码：

```python
import matplotlib.pyplot as plt
import json

def visualize_evaluation(metrics: EvaluationMetrics, output_file: str = "eval_report.png"):
    """可视化评估结果"""

    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    fig.suptitle("Agent系统评估报告", fontsize=16)

    # 1. 步骤级准确率
    ax = axes[0, 0]
    accuracies = [metrics.tool_accuracy, metrics.parameter_accuracy]
    ax.bar(["工具准确率", "参数准确率"], accuracies)
    ax.set_ylim([0, 1])
    ax.set_ylabel("准确率")
    ax.set_title("步骤级评估")

    # 2. 轨迹效率
    ax = axes[0, 1]
    ax.bar(["效率"], [metrics.trajectory_efficiency])
    ax.set_ylim([0, 1])
    ax.set_ylabel("效率(最优/实际)")
    ax.set_title("轨迹级评估")

    # 3. 任务成功率
    ax = axes[0, 2]
    ax.bar(["成功率"], [metrics.task_success_rate])
    ax.set_ylim([0, 1])
    ax.set_ylabel("成功率")
    ax.set_title("任务级评估")

    # 4. 错误恢复率
    ax = axes[1, 0]
    ax.bar(["恢复率"], [metrics.error_recovery_rate])
    ax.set_ylim([0, 1])
    ax.set_ylabel("恢复率")
    ax.set_title("错误处理能力")

    # 5. 执行时间
    ax = axes[1, 1]
    ax.bar(["平均执行时间"], [metrics.avg_execution_time])
    ax.set_ylabel("秒")
    ax.set_title("执行效率")

    # 6. 综合评分
    ax = axes[1, 2]
    ax.bar(["综合评分"], [metrics.overall_quality_score])
    ax.set_ylim([0, 100])
    ax.set_ylabel("分数")
    ax.set_title("综合质量评分")

    plt.tight_layout()
    plt.savefig(output_file, dpi=300, bbox_inches='tight')
    logger.info(f"评估报告已保存: {output_file}")
```

***

**本节总结**：Harness系统的质量评估需要从步骤、轨迹、任务三个层级进行，每层都有特定的指标和计算方法。综合评分权衡了任务成功、效率、准确性和成本。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://yeasy.gitbook.io/harness_engineering_guide/di-si-bu-fen-an-quan-ping-gu-yu-yan-jin/13_evaluation/13.1_methodology.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
