Orkas Orkas
Home首页 Blog博客 Architecture架构
Architecture架构

The Layer That Turns a Model Into a Product: Engineering Orkas's Agent Harness把模型变成产品的那一层:Orkas 的 Agent Harness 工程实现

How Orkas turns model calls into a reliable desktop agent runtime: streaming run loops, tool routing, context compaction, provider abstraction, memory, and crash-safe sessions.Orkas 如何把模型调用变成可靠的桌面 Agent 运行时:流式运行循环、工具路由、上下文压缩、Provider 抽象、记忆与可自愈会话。

Anyone who has shipped an agent product knows the feeling: getting a demo working is fast, but turning it into something a user trusts every day — something that runs all day long on their own machine without falling over — is hard. And the hard part isn't wiring up the model. It's the whole layer that sits around the model.

That layer goes by a few names; the one I'll use is the Agent Harness. It sits between the "big model" and the "product features," and it's the real runtime: it turns a single user request into round after round of conversation with the model, threading tool calls in between, feeding results back, compacting context before it overflows, retrying through network hiccups, and recovering the conversation even after the process crashes. The model does the thinking; the harness makes that thinking land as a reliable sequence of actions.

Orkas is a desktop agent app that runs on the user's own machine, and its harness lives entirely on the client. This article walks through how that layer is built: how it's split into layers, what the run loop looks like, how tools and models are abstracted, and how memory and sessions are handled. Code details have been scrubbed and generalized, but the engineering structure is real.

The layers

Lay an agent product out flat and you get roughly these layers, stacked bottom to top:

┌─────────────────────────────────────────┐
│  Product features  (chat / skills / connectors / sync)  │
├─────────────────────────────────────────┤
│  Agent Harness  (run loop / tools / session)  │
├─────────────────────────────────────────┤
│  Provider abstraction  (unify many LLM vendors)  │
├─────────────────────────────────────────┤
│  Infrastructure  (types / errors / logging / config)  │
└─────────────────────────────────────────┘

There's one consequential choice baked in here: all model inference happens on the client. The desktop app is not a thin client — it holds the harness itself and calls the model directly. The server only handles accounts, multi-device sync, and billing; it doesn't run the agent at all. This decision shaped almost everything downstream: sessions land on the local disk, tools operate directly on the user's working directory, and sensitive data never leaves the machine.

The harness itself breaks into a few pieces: the run loop (runner), the session, the tools, the Provider layer, and memory. Let's take them one at a time.

The run loop: a streaming generator

The heart of the harness is the runner. In one sentence, what it does is: talk to the model over and over until the model says "I'm done."

It's implemented as an async generator, and that choice matters. A single agent run is far more than "send a request, wait for a result." A lot happens in between — the model is emitting tokens, it wants to call a tool, the tool finished, the context got long enough to trigger compaction, the network failed and we're retrying. With callbacks or plain Promises, those intermediate states are hard to surface cleanly to the caller. As a generator, they all become a stream of yield-ed events:

type AgentRunEvent =
  | { type: "text_delta"; text: string }              // model emitting tokens
  | { type: "tool_start"; name: string; input: unknown } // a tool starts executing
  | { type: "tool_end"; name: string; result: string }   // a tool finished
  | { type: "compaction"; tokensBefore: number; tokensAfter: number } // context compacted
  | { type: "retry"; attempt: number; reason: string }   // error, retrying
  | { type: "done"; result: AgentRunResult }             // terminal

The UI subscribes to this event stream and paints the model's output and the tool execution in real time. The non-streaming entry point is internally just "consume the stream to the end, take the final done" — both entry points share one implementation, so there's no second code path to drift out of sync.

What happens inside a turn

Unrolled, one turn looks roughly like this:

  1. Push the user message (possibly with images) into the session history.
  2. Assemble the system prompt, injecting the currently available tools, skill index, and so on.
  3. Parse the model string and resolve it to a concrete Provider and model ID.
  4. Convert all tools into definitions the model understands, and send them out together with the history.
  5. Consume the model's response stream, yield-ing text token by token while collecting any tool calls the model makes.
  6. When the stream ends, look at the model's stop reason:
  • If it's tool_use, the model wants to call a tool — run the tools, then go back to step 5 and ask the model again.
  • Otherwise the turn is over — assemble the result, yield done, and return.

There's one invariant you must hold here: every tool call the model makes must be immediately followed in the history by a matching tool result. The model API enforces this pairing hard — break it and the next request will either error out or just hang. We'll come back to this when we talk about session self-healing.

How a tool call gets routed back

The model doesn't execute tools itself; it only says "I'd like to call read_file with these arguments." Once the runner picks up that intent:

for (const call of toolUseBlocks) {
  yield { type: "tool_start", name: call.name, input: call.input };

  const tool = this.tools.get(call.name);
  const ctx = { workingDir, signal, state: { sandboxEnv } };
  const result = await tool.execute(call.input, ctx);

  // append the result to the session as a tool-result message
  session.addToolResult(call.id, result);

  yield { type: "tool_end", name: call.name, result: result.content };
}

Tools run sequentially, results are written back to the history in the order the model declared them, and then the model is asked again with those results in hand. Seeing the results, the model might call another tool or just give its final answer. This "ask → call → answer → ask again" loop is exactly what lets an agent complete multi-step tasks.

One detail deserves its own mention: some tools return images — screenshots, generated pictures. But many models don't accept images in the tool-result channel. Orkas handles this by splitting the image into a separate user message placed after the tool result — the model first reads "the tool returned this text," then on the very next turn sees the corresponding image. A small compromise that routes around capability differences between Providers.

What to do when context is about to overflow

The wall long tasks hit most often is the context window. Orkas doesn't wait until it's full — it sets a 60% watermark: after each round of tools, it estimates how much of the window the current tokens occupy, and once that passes 60% it proactively triggers compaction.

Compaction itself asks the model to summarize the earlier conversation, then replaces the old messages with that summary, keeping only the most recent tail. Sounds simple, but there's a trap: after the swap, the retained tail must not start with an "orphan tool result" — there can't be a "result with no matching call," or you've broken the pairing invariant again. So the compaction logic makes sure the cut lands on a clean boundary.

There's a more interesting choice worth unpacking here: why the coarse "summarize the whole block at 60%" approach, rather than something more fine-grained — scoring each message and trimming by importance, structured extraction from tool outputs, maintaining a layered memory tree? Those approaches look great in papers, but we deliberately didn't go that way, for three reasons.

First, caching. The model's prompt cache hits by prefix: as long as the history's prefix is unchanged, that span hits the cache, saving both money and latency. Fine-grained compaction constantly rewrites the middle of the history, which means repeatedly shattering the cached prefix — every edit forces a large re-prefill. The "leave it alone, then compact once at the watermark" strategy keeps the prefix stable across the vast majority of turns, with only that single compaction invalidating it. Far friendlier to the cache.

Second, complexity. That "every tool call must be paired" invariant we keep hammering on — the more finely you trim history, the more likely you break it in some corner. A coarse summary only has to protect one clean cut point; there are an order of magnitude fewer places to get it wrong. One fewer class of edge case is one fewer class of production incident.

Third, riding the dividend of better models. Context windows have grown steadily over the past couple of years, and models handle long context better and better. Pouring effort today into an elaborate compaction algorithm is essentially fighting a problem that's shrinking — odds are you finish tuning it just as the next generation doubles its window, and your complexity becomes pure liability. Conversely, handing the summarization to the model itself gets automatically better as the model improves: the better it is at picking out what matters, the higher the summary quality, and we don't change a line. Complexity the model can carry for you is complexity you shouldn't carry yourself.

Token estimation hides one easy-to-miss issue: Chinese. Estimate Chinese using English intuitions (roughly one token per few characters) and you'll badly undercount. Orkas weights CJK characters separately in its estimate; otherwise the watermark for an all-Chinese conversation reads wrong, and compaction won't fire when it should.

Errors and retries

Running on the user's machine and depending on an external model API, errors are the norm, not the exception. The runner sorts them into a few classes and treats each differently:

  • Retryable: rate limits, timeouts, dropped connections, 5xx. Exponential backoff with jitter, capped at 30 seconds; if it's a rate limit and the server sent retry-after, honor it.
  • Non-retryable: things like auth failures — no number of retries will help, so error out immediately.
  • Special: context overflow. Try compaction first, retry once after, and only error out if that still fails.

There's one more class: "the tool itself failed." This doesn't sink the whole turn — a tool failure is itself information for the model, which, seeing "that command errored," can perfectly well try a different approach. The harness distinguishes these transient tool errors from real faults: it neither interrupts the flow nor loses them — they show up in after-the-fact stats. (That data later feeds the self-evolution mechanism, which is the subject of the next article.)

The external cancellation signal (AbortSignal) is checked at every key point. The user hits "stop," and the current turn halts immediately — no new retries get kicked off.

Tool abstraction: simple enough to extend

The tool interface is deliberately thin:

interface AgentTool {
  readonly name: string;
  readonly description: string;          // shown to the model
  readonly inputSchema: Record<string, unknown>;  // JSON Schema to constrain inputs
  execute(input: Record<string, unknown>, ctx: ToolContext): Promise<ToolResult>;
}

A tool is just "a name + a description for the model + an input schema + an execute function." The built-ins — read file, write file, run a shell command, web search and fetch — all implement this interface. The desktop layer stacks a batch of locally-flavored tools on top (knowledge-base search, image generation, calling external connectors), but the interface is the same one.

The payoff of a thin interface is that where a tool comes from is irrelevant to the runner: built-in, user-defined, or loaded from a skill — they're all the same kind of thing, registered into one Map<string, AgentTool> and converted into model-readable definitions each turn.

Side-effecting tools like shell commands go through an isolated executor: timeouts, output-length limits, a command blocklist, and environment variables passed in separately rather than mutating the process's global environment — the latter would leak into a swarm of child processes and, in a multi-process architecture like Electron, can easily break startup.

The Provider layer: flattening many models into one interface

Users' model preferences are all over the map, and a product can't hard-bind to one vendor. Below the harness, Orkas lays down a Provider abstraction that unifies different vendors' models behind a single interface:

interface LLMProvider {
  readonly id: string;
  complete(params: CompletionParams): Promise<CompletionResult>;
  stream(params: CompletionParams): AsyncIterable<StreamEvent>;
  validateAuth(): Promise<boolean>;
}

The runner above only ever talks to this interface; it has no idea which vendor is behind it. A registry handles routing by model string: an explicit provider/model form is split directly; a bare model name is attributed by prefix. Auth (an API key or OAuth token) is managed here too, and an expired OAuth token is refreshed automatically.

Flattening many models, the real headache isn't text completion — it's the corners where vendors' semantics disagree. Two examples that bit us.

One is preserving thinking blocks across vendors. Reasoning models emit a span of "thinking" content; some vendors encrypt it and require you to echo it back verbatim, others represent it with a different set of fields. If a user switches from vendor A to vendor B mid-conversation, the signature on that thinking span in the history no longer matches. The fix is to stamp every message in the history with "which model produced this," so the transform layer can decide whether to keep it verbatim: same model, keep it; different model, downgrade per the rules.

The other is the prompt cache. Across turns of one session the prefix is highly repetitive, and caching it saves meaningful cost and latency. The implementation passes the session ID as the cache key to vendors that support it, handling each vendor's key-length limits along the way (truncate or hash if it's too long, say).

These are all grunt work — but it's precisely this layer of grunt work that lets the runner above pretend "there's only one kind of model."

Memory: two mechanisms, each for its own job

"Memory" in Orkas is actually two parallel mechanisms solving two completely different problems. One is a retrieval-based knowledge base, for the bulk material you "go look up when needed." The other is cross-session memory, for the small set of key facts you "should always have in mind." Many products mash these two together; keeping them separate makes things much clearer.

Knowledge base: hybrid retrieval

The first mechanism targets content that's large but only occasionally relevant — the user's documents, past notes, domain knowledge. This is a local knowledge base with vector retrieval, in two backends: a lightweight pure-memory version (for tests and ephemeral use), and a version persisted to a local database (for production, with full-text indexing and vectors).

Data comes in along this path:

documents → chunk on line boundaries (with overlap) → dual indexing
                                          ├─ full-text index (keywords, no embedding cost)
                                          └─ vector index (if an embedding model is configured)

Chunks are cut on line boundaries with a little overlap between them, to avoid slicing a complete piece of meaning down the middle. Retrieval is hybrid: a vector pass (semantically close) and a keyword pass (literal hits), with the two result sets merged via RRF (Reciprocal Rank Fusion):

score = Σ  1 / (k + rank_i)

The higher a result ranks within one pass, the more it contributes; summed across both passes, you both honor semantic relevance and don't lose exact literal matches. The vector and keyword weights are tunable, defaulting to favor semantics. After merging, results are deduplicated by "(document, start line)," keeping only the best one per location, then anything below a threshold is cut, returning the top-K.

Why not rely on vectors alone? Because vector retrieval often face-plants on proper nouns, code symbols, and exact literal strings — queries that aren't semantically special but where the literal matters a lot — while keyword-only can't catch "same meaning, different phrasing." Running both is a very practical trade-off between retrieval quality and cost.

Cross-session memory: keeping the user in mind

The knowledge base solves "too much material to hold." But there's another class of thing — tiny in volume, yet it must stay in mind at all times: who this user is, what they prefer, what was agreed last time. These shouldn't depend on retrieval to "get lucky and recall" — they should be present every single turn.

For this, Orkas builds a separate layer of cross-session memory, split by content into two parts:

  • User profile: stable facts about the person — role, preferences, communication style, tech stack.
  • Fact notes: durable facts about the work — decisions, milestones, project conventions.

Both are small, each with a hard cap of a few thousand characters, which forces them to keep only what's genuinely useful long-term. They don't go through retrieval; instead they're frozen directly into the system prompt at the start of every turn — meaning the agent simply "knows" these things, without having to remember to go look them up. That's exactly the opposite stance from the knowledge base: the knowledge base is "fetch only when needed, gone after," cross-session memory is "always present, always visible."

Writes go through a dedicated memory tool that the model calls when it judges, mid-conversation, that "this is worth remembering long-term," supporting add, substring-replace, and delete. What to save and what not to is spelled out clearly in the tool's description: user corrections and preferences are top priority; durable decisions and conventions get saved; while the current task's transient state, one-off debug info, and anything easily re-discoverable do not — memory is for "durable facts about the user and the project," not for "where I got to this time."

There's an easy-to-overlook but quite important detail: a security scan runs before every write. This content enters the system prompt verbatim and persists across sessions for a long time — effectively a durable injection surface. So every memory about to be written to disk is first scanned for suspicious patterns — classic prompt-injection phrasing ("ignore all previous instructions" and the like), commands trying to exfiltrate keys, invisible unicode characters hidden in the text — and a match is rejected outright. With deduplication and over-limit trimming on top, this memory layer stays useful without becoming a liability.

Together, the two mechanisms cover both ends — "huge but occasional" and "small but constant": the knowledge base handles the former, cross-session memory the latter. Add to that the agent's understanding of itself (the subject of the next article), and an Orkas agent walks in carrying three kinds of memory at once — about the material, about the user, and about itself.

Sessions: built to crash, built to heal

A session manages the message history. The basic version is just an in-memory array of messages with history trimming and compaction. But anything running on a user's machine has to assume it can be killed at any moment — the user quits the app, the system reboots, a watchdog timeout takes the process out. So production uses a persistent session, written to a local JSONL file, one message per line.

There are two write strategies: appending a new message uses an atomic append; anything that rewrites the whole file (compaction, clearing) uses "write a temp file + atomic rename." That way, even if power cuts mid-write, you never leave half of a corrupted record behind.

The most interesting piece is healing orphaned tool calls. Back to that pairing invariant: the model makes a tool call, the harness executes it, the result gets written back — interrupt any of those three steps and you leave an orphan on disk, "a call with no result." Load that session next time and send it as-is to the model, and the API will either reject it or hang.

The healing logic runs every time a session is loaded from disk, and it's idempotent:

  1. Scan all assistant messages and collect the tool-call IDs they made.
  2. Look forward for the matching tool results.
  3. For any call missing a matching result, synthesize one marked "interrupted."
  4. Along the way, align the result order to the call declaration order, and drop any orphan results that have no matching call.

After this pass, the session is guaranteed to be in a state that satisfies the API's pairing requirement and is safe to send. The mechanism looks unremarkable, but it's the safety net that keeps "a user's conversation doesn't lock up permanently just because of one crash."

A few decisions that mattered in hindsight

Stringing this all together, a few decisions look especially valuable after the fact.

Generators as the primary interface. Streaming and non-streaming share one implementation, intermediate state surfaces naturally, and the UI can paint as much detail as it wants. This saved a whole class of inconsistency bugs that "implement non-streaming first, bolt on streaming later" would have created.

Compact at 60%, not when full. It leaves headroom for compaction itself (which also costs a model call) and avoids scrambling at the last moment.

The pairing invariant runs through everything. From the compaction cut point, to writing to disk, to load-time healing, every place that touches the session holds the same rule. With one rule, no spot has to invent its own patch logic.

Grunt work concentrated in the Provider layer. All the cross-vendor awkwardness — thinking blocks, cache keys, capability differences — gets digested in this one layer, in exchange for a clean runner above. Add a new model vendor someday and the change barely spills out.

Wrapping up

Orkas's harness has no stunning algorithm. Its value is in taking "make an agent run reliably in a real environment" and splitting it into a set of modules with clean boundaries, each owning one slice: the runner owns the loop and retries, tools own capability, the Provider layer owns flattening many models, memory owns retrieval, the session owns persistence and healing. None of them is complex on its own; only together do they hold up something people use every day.

If there's anything to take away: make the run loop a streaming generator and intermediate state gets much easier to handle; once a core invariant is set (like "tool calls must be paired"), hold it consistently across compaction, disk writes, and loading — don't let any corner be the exception; concentrate cross-vendor grunt work in one layer and keep it out of business logic; and — most plainly of all — assume your process will be killed at the worst possible moment, and write the healing for that moment ahead of time.

The next article gets into a more interesting part of Orkas: how this agent learns from its own use, distills experience into reusable skills, and slowly makes itself more useful.

做过 Agent 产品的人大概都有过类似的体感:调通一个 demo 很快,但把它变成一个用户每天敢用、能在自己机器上跑一整天不出岔子的东西,难的根本不是接模型,而是模型之外的那一整层。

那一层有个不太统一的叫法——Agent Harness。它夹在「大模型」和「业务功能」之间,是真正的运行时:负责把一次用户请求翻译成一轮又一轮和模型的对话,在中间穿插工具调用、把结果喂回去、在上下文要爆的时候做压缩、在网络抖动时重试、在进程崩溃后还能把对话救回来。模型负责「想」,harness 负责让这些「想」真正落地成一连串可靠的动作。

Orkas 是一个跑在用户本机的桌面 Agent 应用,它的 harness 完全做在客户端。这篇文章拆一下这一层是怎么搭起来的:整体怎么分层、运行循环长什么样、工具和模型怎么抽象、记忆和会话又是怎么处理的。代码细节做了脱敏和泛化,但工程结构是真实的。

先看分层

把一个 Agent 产品摊开,大致是这么几层自下而上叠起来的:

┌─────────────────────────────────────────┐
│  业务功能层  (会话 / 技能 / 连接器 / 同步)   │
├─────────────────────────────────────────┤
│  Agent Harness  (运行循环 / 工具 / 会话)    │
├─────────────────────────────────────────┤
│  Provider 抽象层  (统一多家大模型)           │
├─────────────────────────────────────────┤
│  基础设施  (类型 / 错误 / 日志 / 配置)        │
└─────────────────────────────────────────┘

这里有个影响深远的取舍:所有的模型推理都发生在客户端。桌面端不是一个瘦客户端,它自己持有 harness,直接发起对模型的调用;服务端只管账号、多端同步、计费这些事,本身不跑 Agent。这个决定塑造了后面几乎所有的设计——会话要落到本地磁盘、工具直接操作用户的工作目录、敏感数据不离开这台机器。

harness 内部又可以切成几块:运行循环(runner)、会话(session)、工具(tools)、Provider、记忆(memory)。下面一块块说。

运行循环:一个流式生成器

整个 harness 的心脏是 runner。它做的事用一句话概括就是:反复地和模型对话,直到模型说「我说完了」

实现上它是一个异步生成器(async generator)。这个选择很关键。一次 agent 运行远不是「发请求、等结果」这么简单,中间会发生很多事——模型在吐字、要调一个工具了、工具跑完了、上下文太长触发了压缩、网络错了在重试。如果用回调或者 Promise,这些中间状态很难干净地透给调用方。换成生成器,它们就都变成了一串 yield 出去的事件:

type AgentRunEvent =
  | { type: "text_delta"; text: string }              // 模型在逐字输出
  | { type: "tool_start"; name: string; input: unknown } // 开始执行某个工具
  | { type: "tool_end"; name: string; result: string }   // 工具执行完毕
  | { type: "compaction"; tokensBefore: number; tokensAfter: number } // 触发了上下文压缩
  | { type: "retry"; attempt: number; reason: string }   // 出错了,正在重试
  | { type: "done"; result: AgentRunResult }             // 终态

UI 层订阅这个事件流,就能实时把模型的输出和工具的执行过程画到屏幕上。而非流式的调用入口,内部其实就是把这个流消费完、只取最后的 done——两个入口共用一套逻辑,不存在两份会跑偏的实现。

一个 turn 里发生了什么

把循环展开,一轮(turn)大致是这样:

  1. 把用户消息(可能带图片)塞进会话历史;
  2. 拼系统提示词,这里会注入当前可用的工具、技能索引等;
  3. 解析模型字符串,定位到具体的 Provider 和模型 ID;
  4. 把所有工具转成模型能理解的定义,连同历史一起发出去;
  5. 消费模型返回的流,逐字 yield 文本,同时收集模型发起的工具调用;
  6. 流结束后看模型的停止原因:
  • 如果是 tool_use,说明模型想调工具,进入工具执行环节,然后回到第 5 步再问一次模型;
  • 否则说明这轮结束了,组装结果、yield done、返回。

这里有一条必须守住的不变式:模型每发起一个工具调用,历史里就必须紧跟一条对应的工具结果。模型 API 对这种配对有硬性要求,缺了配对,下一次请求要么报错、要么直接挂住。后面讲会话自愈时还会回到这一点。

工具调用是怎么转回去的

模型不会自己执行工具,它只会说「我想调用 read_file,参数是这些」。runner 接到这个意图后:

for (const call of toolUseBlocks) {
  yield { type: "tool_start", name: call.name, input: call.input };

  const tool = this.tools.get(call.name);
  const ctx = { workingDir, signal, state: { sandboxEnv } };
  const result = await tool.execute(call.input, ctx);

  // 把结果作为一条工具结果消息追加进会话
  session.addToolResult(call.id, result);

  yield { type: "tool_end", name: call.name, result: result.content };
}

工具串行执行,结果按模型声明的顺序写回历史,然后带着这些结果再问一次模型。模型看到工具结果后,可能继续调下一个工具,也可能直接给出最终答复。这个「问—调—答—再问」的环,就是 agent 能完成多步任务的根本。

有个细节值得单独拎出来:有些工具会返回图片,比如截图、生成图。但不少模型的工具结果通道并不支持塞图片。Orkas 的处理是把图片拆到工具结果之后的一条独立用户消息里——模型先读到「工具返回了这段文字」,紧接着下一轮就看到对应的图。一个小妥协,绕开了不同 Provider 之间的能力差异。

上下文要爆了怎么办

长任务最容易撞上的墙就是上下文窗口。Orkas 没有等撑满才处理,而是设了一道 60% 的水位线:每跑完一轮工具,估一下当前 token 占了窗口多少,超过六成就主动触发压缩。

压缩本身是让模型给前面的对话做一份摘要,再用这份摘要替换掉旧消息,只保留尾部最近的几轮。听起来简单,但有个坑:替换之后,保留的尾部不能以一条「孤儿工具结果」开头,也就是不能出现「有结果没有对应调用」的情况,否则又违反了前面那条配对不变式。所以压缩逻辑会确保切割点落在一个干净的边界上。

这里有个更值得展开的选择:为什么是「到 60% 就整段摘要」这种粗粒度的做法,而不是去做更精细的上下文压缩——比如逐条给消息打分、按重要性裁剪、对工具输出做结构化抽取、维护一棵分层的记忆树?这些方案在论文里都很漂亮,但我们刻意没走那条路,原因有三。

一是缓存。模型那边的 prompt cache 是按前缀命中的:只要历史的前缀不变,这一段就能吃到缓存,省钱又省延迟。精细压缩会不停改写历史的中段,等于反复把缓存前缀打碎,每动一次就要重新预填一大段。而「平时完全不动、到水位线才一次性压缩」的策略,绝大多数轮次里前缀是稳定的,只有压缩那一下会失效一次——对缓存友好太多。

二是复杂度。前面反复强调的那条「工具调用必须配对」的不变式,你越是精细地去裁剪历史,就越容易在某个边角上把它破坏掉。粗粒度摘要只需要守住一个干净的切割点,能出错的地方少了一个数量级。少一类边界情况,就少一类线上事故。

三是吃模型能力提升的红利。上下文窗口这两年是一路在变大的,模型处理长上下文的能力也在变强。今天花大力气写一套精巧的压缩算法,本质上是在跟一个正在缩小的问题较劲——很可能你刚调优完,下一代模型窗口翻一倍,这套复杂度就成了纯负债。反过来,把压缩这件事交给模型自己做摘要,它会随着模型变强而自动变好:模型越会抓重点,摘要质量就越高,我们一行代码都不用改。能让模型替你扛的复杂度,就别自己背。

token 估算这块还藏了个容易被忽略的问题:中文。如果按英文的经验(大致一个 token 对应几个字符)去估中文,会严重低估。Orkas 的估算对 CJK 字符单独算权重,否则纯中文会话的水位线会一直测不准,该触发压缩的时候触发不了。

错误和重试

跑在用户本机、依赖外部模型 API,出错是常态而不是意外。runner 把错误分成几类区别对待:

  • 可重试的:限流、超时、连接断开、5xx。指数退避加抖动,上限 30 秒;如果是限流且服务端给了 retry-after,就听它的。
  • 不可重试的:鉴权失败这类,重试多少次都没用,直接报错返回。
  • 特殊的:上下文溢出。先尝试压缩,压完再试一次,实在不行才报错。

还有一类是「工具自己失败了」。这种不会让整轮挂掉——工具失败本身就是给模型的信息,模型看到「这个命令报错了」,完全可以换个方式再来。harness 会把这种瞬时工具错误和真正的故障区分开,既不打断流程,又能在事后统计里反映出来。(这部分数据后来还喂给了自演进机制,那是下一篇的内容了。)

外部传进来的取消信号(AbortSignal)在每个关键节点都会检查。用户点了「停止」,当前这轮就立刻收手,不会再发起新的重试。

工具抽象:够简单才扩展得动

工具的接口被刻意做得很薄:

interface AgentTool {
  readonly name: string;
  readonly description: string;          // 给模型看的说明
  readonly inputSchema: Record<string, unknown>;  // JSON Schema,用来约束入参
  execute(input: Record<string, unknown>, ctx: ToolContext): Promise<ToolResult>;
}

一个工具就是「名字 + 给模型的说明 + 参数 schema + 一个执行函数」。内置的几个——读文件、写文件、跑 shell 命令、网页搜索和抓取——都按这个接口实现。桌面端在这之上又叠了一批和本地能力相关的工具,比如知识库检索、生成图片、调用外部连接器,但接口是同一套。

薄接口的好处是,工具来自哪里对 runner 来说无所谓:内置的、用户自定义的、从技能里加载的,进到 runner 都是同一种东西,统一注册进一张 Map<string, AgentTool>,每轮转成模型能读的定义发出去。

跑 shell 命令这类有副作用的工具,走的是一个隔离的执行器:有超时、有输出长度限制、有命令黑名单,环境变量单独传进去而不是去改进程的全局环境——后者会泄漏给一堆子进程,在 Electron 这种多进程架构里很容易把启动搞崩。

Provider 层:把多家模型抹平成一个接口

用户的模型偏好五花八门,产品不可能绑死一家。Orkas 在 harness 下面垫了一层 Provider 抽象,把不同厂商的模型统一成一个接口:

interface LLMProvider {
  readonly id: string;
  complete(params: CompletionParams): Promise<CompletionResult>;
  stream(params: CompletionParams): AsyncIterable<StreamEvent>;
  validateAuth(): Promise<boolean>;
}

上层的 runner 永远只跟这个接口打交道,根本不知道背后接的是哪家。一个注册表(registry)负责按模型字符串路由:provider/model 这种显式写法直接拆;只写模型名的,按前缀推断归属。鉴权信息(API key 或 OAuth token)也由它统一管理,OAuth token 过期了会自动刷新。

抹平多家模型,真正麻烦的不是文本补全,而是那些各家语义不一致的角落。举两个被坑过的例子。

一个是思考块(thinking)的跨厂商保持。带推理的模型会产出一段「思考」内容,有的厂商把它加密、要求你原样回传,有的用另一套字段表示。如果用户在一轮对话里从 A 家切到 B 家,历史里那段思考的签名就对不上了。处理办法是给历史里每条消息都盖上「它当时是哪个模型产生的」这个戳,转换层据此判断要不要原样保留:同模型才保,跨模型就按规则降级。

另一个是提示词缓存(prompt cache)。同一个会话多轮之间,前缀是高度重复的,把它缓存住能省下可观的成本和延迟。实现上是把会话 ID 作为缓存键传给支持的厂商,顺带处理各家对键长度的限制,比如太长就截断或哈希。

这些都是脏活,但正是这层脏活,让上面的 runner 能假装「模型只有一种」。

记忆:两套机制,各管一段

「记忆」在 Orkas 里其实是两套并行的机制,解决的是两类完全不同的问题。一套是检索式的知识库,对应「需要时去翻」的大块资料;另一套是跨会话记忆,对应「应该一直记着」的少量关键事实。很多产品把这两件事混成一团,分开看会清楚很多。

知识库:混合检索

第一套面向的是体量大、但只是偶尔用得上的内容——用户的文档、过往的笔记、领域知识。这部分是一套带向量检索的本地知识库,两种后端:轻量的纯内存版(测试和临时用),和落到本地数据库的持久版(生产用,带全文索引和向量)。

数据进来的链路是这样的:

文档 → 按行边界切块(带重叠) → 双路索引
                                ├─ 全文索引(关键词,无嵌入成本)
                                └─ 向量索引(若配了嵌入模型)

切块按行边界切、块之间留一点重叠,避免把一段完整语义从中间劈开。检索时走的是混合检索:向量搜一遍(语义相近),关键词搜一遍(字面命中),两路结果用 RRF(Reciprocal Rank Fusion,倒数排名融合)合并:

score = Σ  1 / (k + rank_i)

某条结果在一路里排名越靠前,贡献的分越高;两路加起来,既照顾到语义相关、又不丢字面精确匹配。向量和关键词各自的权重可调,默认偏向语义。合并后按「文档 + 起始行」去重,每个位置只留最好的那条,再砍掉低于阈值的,返回 top-K。

为什么不纯靠向量?因为向量检索对专有名词、代码符号、精确字面串这类「语义上不特殊但字面很重要」的查询经常翻车;而纯关键词又抓不住「换了种说法但意思一样」的情况。两路一起上,是检索质量和成本之间一个很实在的折中。

跨会话记忆:把用户记在心上

知识库解决的是「资料太多记不下」。但还有另一类东西,量很小,却必须一直挂在脑子里——这个用户是谁、他偏好什么、上次定下的约定是什么。这些不该靠检索去「碰运气召回」,而应该每一轮都在场。

为此 Orkas 单独做了一层跨会话记忆,按内容分成两份:

  • 用户画像:角色、偏好、沟通风格、技术栈这类关于「人」的稳定信息;
  • 事实笔记:决定、里程碑、项目约定这类关于「事」的长期事实。

两份都很小,各自有几千字符的硬上限,逼着它只留真正长期有用的东西。它们不走检索,而是在每轮对话开始时直接冻进系统提示词——也就是说,agent 天然就「知道」这些事,不需要先想起来再去查。这跟知识库正好是两种取向:知识库是「用时才捞、捞完就走」,跨会话记忆是「一直在场、人人能看见」。

写入由一个专门的记忆工具完成,模型在对话里判断「这条值得长期记」时调用它,支持新增、按子串替换、删除。什么该记、什么不该记,工具说明里划得很清楚:用户的纠正和偏好优先级最高,长期有效的决定和约定要记;而当前任务的临时状态、一次性的调试信息、能轻易重新查到的东西,一律不记——记忆是给「关于用户和项目的持久事实」用的,不是给「这次干到哪了」用的。

有个容易被忽略、但相当重要的细节:写入前要过一道安全扫描。这些内容会原样进系统提示词、还跨会话长期留存,等于是一块持久的注入面。所以每条要落盘的记忆都会先扫一遍可疑模式——典型的提示词注入话术(「忽略以上所有指令」之类)、想偷密钥的命令、藏在文本里的不可见 unicode 字符,命中就直接拒写。再配上去重和超限自动裁剪,这层记忆才既好用、又不至于变成风险点。

两套机制合起来,正好兜住了「海量但偶尔用」和「少量但一直要」这两端:知识库管前者,跨会话记忆管后者。再叠加上下一篇要讲的、agent 对自己的认知,一个 Orkas agent 是同时带着三种记忆上场的——关于资料的、关于用户的、关于它自己的。

会话:能崩、能自愈

会话(session)管的是消息历史。基础版就是内存里一个消息数组,带历史裁剪和压缩。但跑在用户机器上的东西,得假设它随时会被杀掉——用户关了 app、系统重启、看门狗超时把进程干掉。所以生产用的是持久会话,落到本地一个 JSONL 文件,一行一条消息。

落盘策略分两种:追加新消息走原子追加(append);压缩或清空这种要重写整个文件的,走「写临时文件 + 原子改名」。这样即便写到一半断电,也不会留下半条损坏的记录。

最值得说的是孤儿工具调用的自愈。回到前面那条配对不变式:模型发起工具调用、harness 执行、写回结果,这三步之间任何一处被打断,磁盘上就会留下一个「有调用没结果」的孤儿。下次加载这个会话、原样发给模型,API 要么拒绝、要么挂住。

自愈逻辑在每次从磁盘加载会话时跑一遍,而且是幂等的:

  1. 扫所有助手消息,收集它们发起的工具调用 ID;
  2. 往后找对应的工具结果;
  3. 哪个调用没有配对结果,就给它补一条合成的结果,内容标记为「已中断」;
  4. 顺手把结果顺序对齐到调用的声明顺序,并丢掉那些找不到对应调用的孤儿结果。

跑完这一遍,会话一定处在符合 API 配对要求、可以安全发出去的状态。这个机制看着不起眼,但它是「用户的对话不会因为一次崩溃就彻底卡死」的兜底。

几个回头看挺关键的决定

把这套东西串起来,有几个决定事后看价值很大。

用生成器做主接口。 流式和非流式共用一套逻辑,中间状态天然透得出来,UI 想画多细就画多细。这比「先实现非流式、再单独补一套流式」省掉了一整类不一致的 bug。

60% 就开始压缩,而不是等撑满。 给压缩本身(它也要调一次模型)留了余量,也避免在最后一刻手忙脚乱。

配对不变式贯穿始终。 从压缩的切割点、到落盘、到加载自愈,所有改动会话的地方都守着同一条规则。规则统一了,各处就不用各自发明各自的修补逻辑。

脏活集中在 Provider 层。 跨厂商的所有别扭——思考块、缓存键、能力差异——都摁在这一层消化掉,换来上面 runner 的干净。哪天要加一家新模型,改动基本不外溢。

小结

Orkas 的 harness 没有什么惊人的算法。它的价值在于把「让一个 agent 在真实环境里可靠地跑」这件事,拆成了一组边界清楚、各管一段的模块:runner 管循环和重试,工具管能力,Provider 管抹平多模型,记忆管检索,会话管持久化和自愈。每一块单看都不复杂,凑在一起才撑得住一个能天天用的东西。

真要总结点经验:运行循环做成流式生成器,中间状态好透出去得多;核心不变式(比如工具调用必须配对)一旦定了,就得在压缩、落盘、加载每一处都一致地守,别让任何一个角落例外;跨厂商的脏活越集中越好,渗进业务逻辑就再难收拾。还有一条最朴素的——默认你的进程会在最糟的时刻被杀掉,然后提前把自愈写好。

下一篇接着讲 Orkas 更有意思的一块:这个 agent 是怎么从自己的使用过程里学习、把经验沉淀成可复用的技能,慢慢把自己变得更好用的。