> For the complete documentation index, see [llms.txt](https://yeasy.gitbook.io/harness_engineering_guide/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://yeasy.gitbook.io/harness_engineering_guide/di-san-bu-fen-xi-tong-ji-cheng-yu-gong-cheng-shi-jian/08_orchestration/8.2_state_machine.md).

# 8.2 状态机与工作流引擎

状态机是工作流编排的数学基础，通过有限个状态和转移规则来描述系统行为。本节介绍 FSM 原理、Claude Code 的工作流实现、OpenClaw 的 Lobster 引擎，以及 Python 状态机的完整实现。

## 8.2.1 有限状态机的核心原理

有限状态机(Finite State Machine, FSM)是工作流引擎的数学基础。FSM通过状态和转移来描述系统在不同条件下的行为演进。

在Agent编排中，FSM定义了：

1. **有限个状态**：工作流的每个阶段（如“待验证”、“执行中”、“完成”）
2. **输入符号**：触发转移的事件（如“approve”、“error”、“timeout”）
3. **转移函数**：δ(state, event) → new\_state
4. **初始状态** 和 **接受状态**

## 8.2.2 Claude Code的工作流执行方式

在Claude Code中，FSM通过智能体的Task系统实现，支持通过Task的Workflow类型来定义状态机。主要特点：

* **代码优先**：通过Python代码定义状态和转移
* **灵活的条件**：支持任意Python表达式作为转移条件
* **内置上下文**：自动维护上下文变量和执行历史
* **错误处理**：与智能体的错误恢复机制深度集成

## 8.2.3 OpenClaw Lobster引擎概述

**Lobster** 是OpenClaw的确定性工作流引擎，其核心特点：

* **声明式定义**：YAML格式工作流定义，无需编码
* **确定性执行**：相同输入保证相同的执行路径和输出
* **副作用暂停**：在执行副作用前暂停，等待人工审批
* **自动恢复**：支持从检查点恢复中断的执行
* **可审计**：完整的执行历史和决策日志

Lobster引擎的执行过程：

```
1. YAML解析 → 构建内部FSM表示
2. 初始化执行上下文
3. 循环执行:
   - 评估当前状态的出边条件
   - 选择满足条件的转移
   - 标记副作用(需审批)或执行纯计算
   - 状态转移
   - 检查是否到达终止状态

```

状态机的核心循环提供了通用的执行框架。为了使非技术人员也能定义和配置工作流，系统设计了一套标准的YAML语法，用于声明式地描述工作流的结构和行为。

## 8.2.4 YAML工作流定义语法

### 基本结构

YAML工作流定义的基本结构如下：

```yaml
version: "1.0"
name: "工作流名称"
description: "工作流描述"

## 全局变量
variables:
  max_retries: 3
  timeout: 300

## 状态定义
states:
  start:
    type: initial

  validate:
    type: normal
    actions:
      - id: validation_check
        type: tool_call
        tool: validator
        params:
          data: "{{ context.input }}"

  execute:
    type: normal
    actions:
      - id: main_execution
        type: tool_call
        tool: executor
        side_effect: true  # 需要审批

  retry:
    type: normal
    actions:
      - id: retry_execution
        type: tool_call
        tool: executor

  error:
    type: final

  complete:
    type: final

## 转移定义
transitions:
  - from: start
    to: validate

  - from: validate
    to: execute
    condition: "{{ result.validation_check.success }}"

  - from: validate
    to: error
    condition: "{{ not result.validation_check.success }}"

  - from: execute
    to: complete
    condition: "{{ result.main_execution.success }}"

  - from: execute
    to: retry
    condition: "{{ attempt < variables.max_retries }}"

## 错误处理
error_handlers:
  - on_state: "*"
    action: log_error
	    fallback_state: error
```

`tool_call` 动作描述的是工作流意图，不代表编排层可以直接绕过执行治理。实际执行必须进入运行时 / 工具执行器的权限、参数验证、审计和追踪路径，尤其是 `side_effect: true` 的动作。

### 状态类型详解

| 状态类型     | 含义   | 特点               |
| -------- | ---- | ---------------- |
| initial  | 初始状态 | 工作流启动时的状态，有且仅有一个 |
| normal   | 普通状态 | 执行actions，评估转移条件 |
| final    | 终止状态 | 工作流成功完成，可有多个     |
| error    | 异常状态 | 工作流失败或异常中止       |
| wait     | 等待状态 | 等待外部输入或异步结果      |
| parallel | 并行状态 | 同时执行多个分支         |

### 条件分支示例

条件分支的YAML定义示例如下：

```yaml
states:
  classify:
    type: normal
    actions:
      - id: classify_task
        type: tool_call
        tool: classifier
        params:
          input: "{{ context.data }}"

transitions:
  # 基于分类结果的条件分支
  - from: classify
    to: path_a
    condition: "{{ result.classify_task.category == 'typeA' }}"

  - from: classify
    to: path_b
    condition: "{{ result.classify_task.category == 'typeB' }}"

  - from: classify
    to: path_c
    condition: "{{ result.classify_task.category == 'typeC' }}"

  # 默认分支
  - from: classify
    to: unknown
    condition: "{{ true }}"
```

### 循环与重试

循环与重试的YAML定义示例如下：

```yaml
states:
  retry_loop:
    type: normal
    variables:
      attempt: 0
    actions:
      - id: attempt_operation
        type: tool_call
        tool: operation
        params:
          data: "{{ context.data }}"

transitions:
  # 成功则继续
  - from: retry_loop
    to: process_result
    condition: "{{ result.attempt_operation.success }}"

  # 失败且还有重试次数则重试
  - from: retry_loop
    to: retry_loop
    condition: "{{ not result.attempt_operation.success and state.attempt < variables.max_retries }}"
    on_transition:
      - action: increment_variable
        variable: state.attempt

  # 失败且无重试次数则进入错误处理
  - from: retry_loop
    to: handle_error
    condition: "{{ not result.attempt_operation.success and state.attempt >= variables.max_retries }}"
```

## 8.2.5 Python状态机实现

状态机的Python实现包括数据结构定义、执行引擎逻辑、和具体使用示例。我们将分三部分展示：

### 第一部分：数据结构与枚举类型

首先定义表示状态、动作、转移的数据结构：

```python
from typing import Dict, List, Callable, Any, Optional, Tuple
from dataclasses import dataclass, field
from enum import Enum
import json
from datetime import datetime

class ActionType(Enum):
    """Action类型"""
    TOOL_CALL = "tool_call"
    CONDITIONAL = "conditional"
    SET_VARIABLE = "set_variable"
    LOG = "log"
    PARALLEL = "parallel"

class StateType(Enum):
    """状态类型"""
    INITIAL = "initial"
    NORMAL = "normal"
    FINAL = "final"
    ERROR = "error"
    WAIT = "wait"
    PARALLEL = "parallel"

@dataclass
class Action:
    """Action定义"""
    action_id: str
    action_type: ActionType
    side_effect: bool = False
    params: Dict[str, Any] = field(default_factory=dict)
    result: Optional[Dict[str, Any]] = None
    executed: bool = False

@dataclass
class State:
    """状态定义"""
    state_id: str
    state_type: StateType
    actions: List[Action] = field(default_factory=list)
    variables: Dict[str, Any] = field(default_factory=dict)

@dataclass
class Transition:
    """转移定义"""
    from_state: str
    to_state: str
    condition: Optional[Callable[[Dict[str, Any]], bool]] = None
    on_transition: List[Callable] = field(default_factory=list)
```

**设计说明**：分离ActionType和StateType的枚举使代码更加类型安全。每个Action都包含“副作用”标记，这对于支持需要人工审批的操作至关重要。

### 第二部分：工作流执行引擎

执行引擎是状态机的核心，负责状态转移和动作执行的逻辑：

```python
class WorkflowExecutor:
    """工作流执行引擎"""

    def __init__(self):
        self.states: Dict[str, State] = {}
        self.transitions: List[Transition] = []
        self.current_state: Optional[str] = None
        self.execution_history: List[Dict] = []
        self.context: Dict[str, Any] = {}
        self.execution_paused = False
        self.pending_approvals: Dict[str, Action] = {}

    def add_state(self, state: State) -> None:
        """添加状态"""
        self.states[state.state_id] = state

    def add_transition(self, transition: Transition) -> None:
        """添加转移"""
        self.transitions.append(transition)

    def initialize(self, initial_state: str, context: Dict[str, Any]) -> None:
        """初始化工作流"""
        if initial_state not in self.states:
            raise ValueError(f"初始状态{initial_state}不存在")
        self.current_state = initial_state
        self.context = context
        self.execution_history = []
        self.pending_approvals = {}
        self.execution_paused = False

    def evaluate_condition(self, condition: Callable) -> bool:
        """评估转移条件"""
        try:
            return condition(self.context)
        except Exception as e:
            print(f"条件评估失败: {e}")
            return False

    def find_next_state(self) -> Optional[str]:
        """根据当前状态和条件查找下一个状态"""
        applicable_transitions = [
            t for t in self.transitions if t.from_state == self.current_state
        ]
        for transition in applicable_transitions:
            if transition.condition is None or self.evaluate_condition(transition.condition):
                return transition.to_state
        return None
```

**设计说明**：`find_next_state` 采用了第一个满足条件的转移，这意味着转移的顺序很重要。在实际应用中可以添加优先级或更复杂的选择策略。

### 第三部分：动作执行与审批处理

这部分处理具体的动作执行，包括对副作用的暂停和人工审批：

```python
from __future__ import annotations
from typing import Any, Dict


class WorkflowExecutor:
    # 省略 __init__、状态注册和转移选择方法；此处只展示执行/审批相关方法。

    def execute_action(self, action: Action) -> bool:
        """执行单个Action"""
        if action.executed:
            return True

        if action.action_type == ActionType.TOOL_CALL:
            if action.side_effect:
                self.pending_approvals[action.action_id] = action
                self.execution_paused = True
                print(f"[审批] Action {action.action_id} 需要批准")
                return False
            else:
                action.result = self._simulate_tool_call(action.params)
                action.executed = True
                return True

        elif action.action_type == ActionType.CONDITIONAL:
            result = self.evaluate_condition(action.params.get("condition"))
            action.result = {"value": result}
            action.executed = True
            return True

        elif action.action_type == ActionType.SET_VARIABLE:
            var_name = action.params.get("name")
            var_value = action.params.get("value")
            self.context[var_name] = var_value
            action.executed = True
            return True

        elif action.action_type == ActionType.LOG:
            message = action.params.get("message", "")
            print(f"[日志] {message}")
            action.executed = True
            return True
        return False

    def _simulate_tool_call(self, params: Dict[str, Any]) -> Dict[str, Any]:
        """模拟工具调用"""
        return {"success": True, "output": f"Executed with {params}"}

    def execute_state(self) -> bool:
        """执行当前状态的所有actions"""
        if self.current_state is None:
            return False
        state = self.states[self.current_state]
        print(f"\n[进入状态] {self.current_state}")
        for action in state.actions:
            if not self.execute_action(action):
                if self.execution_paused:
                    return False
        return True

    def approve_action(self, action_id: str) -> bool:
        """批准等待的副作用Action"""
        if action_id not in self.pending_approvals:
            return False
        action = self.pending_approvals[action_id]
        action.result = self._simulate_tool_call(action.params)
        action.executed = True
        del self.pending_approvals[action_id]
        if not self.pending_approvals:
            self.execution_paused = False
        return True

    def step(self) -> bool:
        """执行一步:执行当前状态,查找下一个状态"""
        if self.current_state is None:
            return False
        if not self.execute_state():
            return False
        self.execution_history.append({
            "timestamp": datetime.now().isoformat(),
            "state": self.current_state,
            "context": dict(self.context),
        })
        next_state = self.find_next_state()
        if next_state is None:
            return False
        self.current_state = next_state
        if self.states[self.current_state].state_type == StateType.FINAL:
            return False
        return True

    def run(self) -> Dict[str, Any]:
        """执行工作流直到完成或错误"""
        max_iterations = 100
        iterations = 0
        while iterations < max_iterations:
            if self.execution_paused:
                break
            if not self.step():
                break
            iterations += 1
        return {
            "final_state": self.current_state,
            "context": self.context,
            "history": self.execution_history,
            "paused": self.execution_paused,
            "pending_approvals": list(self.pending_approvals.keys()),
        }

    def get_status(self) -> Dict[str, Any]:
        """获取当前工作流状态"""
        return {
            "current_state": self.current_state,
            "is_paused": self.execution_paused,
            "pending_approvals": list(self.pending_approvals.keys()),
            "context": self.context,
            "history_length": len(self.execution_history),
        }
```

**设计说明**：`approve_action` 方法允许外部系统（如用户或管理员）在工作流暂停时批准或拒绝操作。`execute_action` 必须先跳过已经执行的动作，否则审批后再次 `run()` 会把同一副作用动作重新挂起，无法继续转移到下一个状态。

### 第四部分：使用示例

以下是一个完整的审批工作流示例，展示如何构建和执行状态机。首先定义状态，然后定义转移，最后执行工作流：

```python
# 构建审批工作流
executor = WorkflowExecutor()

# 定义状态
executor.add_state(State(
    state_id="submitted",
    state_type=StateType.INITIAL,
    actions=[Action(action_id="validate_input", action_type=ActionType.TOOL_CALL)],
))
executor.add_state(State(
    state_id="reviewing",
    state_type=StateType.NORMAL,
    actions=[Action(
        action_id="apply_changes",
        action_type=ActionType.TOOL_CALL,
        side_effect=True,  # 需要人工审批
        params={"target": "production"},
    )],
))
executor.add_state(State(state_id="approved", state_type=StateType.FINAL))
executor.add_state(State(state_id="rejected", state_type=StateType.FINAL))

# 定义转移
executor.add_transition(Transition(
    from_state="submitted",
    to_state="reviewing",
    condition=lambda ctx: ctx.get("valid", False),
))
executor.add_transition(Transition(
    from_state="submitted",
    to_state="rejected",
    condition=lambda ctx: not ctx.get("valid", False),
))
executor.add_transition(Transition(
    from_state="reviewing",
    to_state="approved",
    condition=lambda ctx: ctx.get("approved", False),
))

# 初始化并执行
executor.initialize("submitted", context={"valid": True})
result = executor.run()
print(result)
# {"final_state": "reviewing", "paused": True, "pending_approvals": ["apply_changes"], ...}

# 模拟人工审批
executor.context["approved"] = True
executor.approve_action("apply_changes")
result = executor.run()
print(result)
# {"final_state": "approved", "paused": False, ...}
```

## 8.2.6 错误处理与恢复

### 错误处理策略

工作流可能在多个阶段出现错误：

1. **验证错误**：输入数据不合法
2. **执行错误**：工具调用失败
3. **超时错误**：任务执行超时
4. **审批拒绝**：人工审批被拒

```yaml
error_handlers:
  - on_state: validate
    error_type: validation_error
    action: notify_requester
    fallback_state: rejected

  - on_state: execute
    error_type: tool_failure
    action: retry_with_backoff
    max_retries: 3
    fallback_state: manual_intervention

  - on_state: "*"
    error_type: timeout
    action: kill_workflow
    fallback_state: error_state
```

### 检查点与恢复

通过保存执行状态，工作流可以从中断点恢复：

```python
def save_checkpoint(self) -> str:
    """保存检查点"""
    checkpoint = {
        "current_state": self.current_state,
        "context": self.context,
        "execution_history": self.execution_history,
        "timestamp": datetime.now().isoformat(),
    }
    checkpoint_id = hash_checkpoint(checkpoint)
    # 保存到持久化存储
    return checkpoint_id

def restore_from_checkpoint(self, checkpoint_id: str) -> bool:
    """从检查点恢复"""
    checkpoint = load_checkpoint(checkpoint_id)
    if not checkpoint:
        return False

    self.current_state = checkpoint["current_state"]
    self.context = checkpoint["context"]
    self.execution_history = checkpoint["execution_history"]
    return True
```

## 8.2.7 代码即工作流：Deep Agents 的 Interpreter Skills

前面两节展示了两种把“确定性”注入工作流的路径：Claude Code 用代码定义的 Task 状态机（8.2.2），OpenClaw 用声明式 YAML 的 Lobster 引擎（8.2.3）。两者都**预先固定了执行路径**。但现代智能体 Harness 的主流设计恰恰相反——让模型基于当前上下文自行决定下一步。这就带来一个反复出现的工程问题：

> 如何在保留模型自主性的同时，确保某段关键流程**一定按既定方式执行**？

LangChain 在 Deep Agents 中给出的一个实验性答案是 [**Interpreter Skills（解释器技能）**](https://www.langchain.com/blog/interpreter-skills)。它把 Agent Skills 标准（`SKILL.md`，开放规范的系统介绍见《智能体 AI 权威指南》4.4 节，本书 1.3、10.2 亦有涉及）与 **解释器（Interpreter）** 结合：解释器是一个与 Harness 协同运行的 TypeScript 运行时，智能体可在其中直接编写并执行代码，而 Harness 仍然控制这段代码能触及什么——文件、网络、工具、子智能体都需要显式授权，便于 Harness 在此处做允许、计量与审查。

Interpreter Skills 的形态是“指令 + 模块”两层：

* `SKILL.md` 通过 frontmatter 声明技能元数据，并用 `metadata.module` 指向一个代码模块，告诉模型**何时**该用这个行为；
* `index.ts` 导出工作流或辅助函数，定义这个行为**如何**执行。

```markdown
---
name: github-triage
description: Use this skill to triage GitHub issues, pull requests, and discussions.
metadata:
  module: ./index.ts
---

当用户请求仓库分诊（triage）时使用本技能。
通过解释器导入模块并调用 `triage(repo, options)`。
```

当技能适用时，模型自行决定调用时机与入参，由解释器执行模块代码：

```ts
const { triage } = await import("@/skills/github-triage");

const result = await triage("langchain-ai/deepagents", {
  issues: true,
  prs: true,
  discussions: true,
});

result.toMarkdown();   // 返回结构化结果，并可渲染为模型友好的报告
```

### 一个编排型工作流的例子

以“仓库分诊”为例，`triage()` 内部是一条固定的流水线：

1. 按入参从 GitHub 拉取所有未关闭的 issue / PR / discussion；
2. 为**每个条目** spawn 一个子智能体，生成压缩后的描述；
3. 将子智能体的结果投入一个队列；
4. 逐条消费队列，由子智能体判断该条目应归入已有聚类，还是新建聚类。

这正是 8.3 将要展开的多智能体编排，只是这里它被封装成一段**可复用、可测试、可版本化**的代码，而非散落在提示词里的步骤说明。也正因为运行在解释器里、而非外部脚本，模块能直接参与 Harness 循环：spawn 子智能体、调度与等待任务图、处理部分失败，并在全部完成后才把控制权交还模型。

### 为什么是“技能”，而不是脚本或工具？

| 封装形态                         | 适合做什么                         | 局限                                                              |
| ---------------------------- | ----------------------------- | --------------------------------------------------------------- |
| **工具（Tool）**                 | 跨越外部边界：取数、读写文件、建工单、调用分类器      | 把每个 parse / join / filter 都做成工具会**膨胀动作面**，模型要在更多小动作间做更多次模型中介的决策 |
| **脚本（Script）**               | 一次性外部计算，通过命令行参数 / stdout 通信   | 无法自然参与 Harness 循环（spawn 子智能体、调度任务图、处理部分失败）                      |
| **解释器技能（Interpreter Skill）** | 把“已知有效”的确定性子流程封装为代码，由模型决定何时调用 | 仍依赖模型正确判断“何时适用”；目前是 Deep Agents 特定的实验特性                         |

把确定性放进代码模块，还顺带改善了**长流程下的稳定性**与**可评估性**：

* **缓解上下文焦虑**：仓库分诊不是一个决策，而是数百个小决策的链条。若全靠提示词与上下文窗口来支撑，模型在接近工作上下文边缘时容易过度压缩流程、走捷径——即本书 8.3.4、2.2 讨论的“上下文焦虑”。把流程交给代码后，模型只需调用一次例程，由代码去创建数百个子任务、收集并分类，模型不再为每个中间步骤承担工作记忆负担。
* **更清晰的评估信号**：纯提示词流程只能问“模型大体上遵循指令了吗”；而对解释器技能，可以问出更具体的问题——**是否调用了预期的函数？是否传入了预期的入参？返回的结构是否符合预期？**

### 工程启示：把“最佳已知例程”沉淀为库

至此，本节出现了三种为工作流注入确定性的路径，可对照理解：

| 方案                 | 确定性来源    | 形态                 | 代表系统                  |
| ------------------ | -------- | ------------------ | --------------------- |
| Task 工作流           | 代码定义的状态机 | Python 代码          | Claude Code（8.2.2）    |
| Lobster 引擎         | 声明式 FSM  | YAML 配置            | OpenClaw（8.2.3）       |
| Interpreter Skills | 可复用代码模块  | TS 模块 + `SKILL.md` | LangChain Deep Agents |

Deep Agents 是本书 1.4 已确立的 Harness 参考系统之一。Interpreter Skills 的设计取向可以概括为一句话：**“外层交给裁量，内层交给确定性（discretion on the outside, determinism on the inside）”**——模型负责判断“要不要用、用在哪”，代码负责“具体怎么跑”。

> **注意**：Interpreter Skills 是 LangChain 在 Deep Agents 中提出的**实验性扩展**（原文明确为 “We're experimenting with”），并非行业标准。它与 8.2.2／8.2.3 的状态机方案是并列的设计选择而非取代关系：当流程天然是有限状态迁移时，声明式 FSM 往往更直观；当流程更像“一段需要原子完成的子程序”时，代码模块更合适。

## 8.2.8 本小节小结

状态机提供了优雅的方式来描述工作流的行为。OpenClaw 的 Lobster 引擎通过 YAML 声明式定义和确定性执行，使复杂工作流的编写和维护变得简洁可靠；结合 Python 实现的执行引擎，我们可以处理条件分支、循环、重试和错误恢复等复杂场景。而 LangChain Deep Agents 的 Interpreter Skills 则代表另一条路径——把确定性封装为可复用的代码模块，由模型决定调用时机，实现“外层裁量、内层确定”。下一节将探讨多个智能体如何协调执行更复杂的任务。