HomeInício首页ホーム BlogBlog博客ブログ ArchitectureArquitetura架构アーキテクチャ

ArchitectureArquitetura架构アーキテクチャ

The Layer That Turns a Model Into a Product: Engineering Orkas's Agent HarnessA camada que transforma um modelo em um produto: Agent Harness da engenharia Orkas把模型变成产品的那一层：Orkas 的 Agent Harness 工程实现モデルをプロダクトに変える層：Orkas Agent Harness の設計

How Orkas turns model calls into a reliable desktop agent runtime: streaming run loops, tool routing, context compaction, provider abstraction, memory, and crash-safe sessions.Como Orkas transforma chamadas de modelo em um tempo de execução confiável de agente desktop: loops de execução de streaming, roteamento de ferramentas, compactação de contexto, abstração de provedor, memória e sessões seguras contra falhas.Orkas 如何把模型调用变成可靠的桌面 Agent 运行时：流式运行循环、工具路由、上下文压缩、Provider 抽象、记忆与可自愈会话。Orkas がモデル呼び出しを、信頼できるデスクトップ agent runtime に変える方法。streaming run loop、tool routing、context compaction、provider abstraction、memory、crash-safe sessions。

Orkas TeamEquipe OrkasOrkas 团队Orkas チーム Jun 10, 202610 de junho de 20262026 年 6 月 10 日2026年6月10日

Anyone who has shipped an agent product knows the feeling: getting a demo working is fast, but turning it into something a user trusts every day — something that runs all day long on their own machine without falling over — is hard. And the hard part isn't wiring up the model. It's the whole layer that sits around the model.

That layer goes by a few names; the one I'll use is the Agent Harness. It sits between the "big model" and the "product features," and it's the real runtime: it turns a single user request into round after round of conversation with the model, threading tool calls in between, feeding results back, compacting context before it overflows, retrying through network hiccups, and recovering the conversation even after the process crashes. The model does the thinking; the harness makes that thinking land as a reliable sequence of actions.

Orkas is a desktop agent app that runs on the user's own machine, and its harness lives entirely on the client. This article walks through how that layer is built: how it's split into layers, what the run loop looks like, how tools and models are abstracted, and how memory and sessions are handled. Code details have been scrubbed and generalized, but the engineering structure is real.

The layers

Lay an agent product out flat and you get roughly these layers, stacked bottom to top:

┌─────────────────────────────────────────┐
│  Product features  (chat / skills / connectors / sync)  │
├─────────────────────────────────────────┤
│  Agent Harness  (run loop / tools / session)  │
├─────────────────────────────────────────┤
│  Provider abstraction  (unify many LLM vendors)  │
├─────────────────────────────────────────┤
│  Infrastructure  (types / errors / logging / config)  │
└─────────────────────────────────────────┘

There's one consequential choice baked in here: all model inference happens on the client. The desktop app is not a thin client — it holds the harness itself and calls the model directly. The server only handles accounts, multi-device sync, and billing; it doesn't run the agent at all. This decision shaped almost everything downstream: sessions land on the local disk, tools operate directly on the user's working directory, and sensitive data never leaves the machine.

The harness itself breaks into a few pieces: the run loop (runner), the session, the tools, the Provider layer, and memory. Let's take them one at a time.

The run loop: a streaming generator

The heart of the harness is the runner. In one sentence, what it does is: talk to the model over and over until the model says "I'm done."

It's implemented as an async generator, and that choice matters. A single agent run is far more than "send a request, wait for a result." A lot happens in between — the model is emitting tokens, it wants to call a tool, the tool finished, the context got long enough to trigger compaction, the network failed and we're retrying. With callbacks or plain Promises, those intermediate states are hard to surface cleanly to the caller. As a generator, they all become a stream of yield-ed events:

type AgentRunEvent =
  | { type: "text_delta"; text: string }              // model emitting tokens
  | { type: "tool_start"; name: string; input: unknown } // a tool starts executing
  | { type: "tool_end"; name: string; result: string }   // a tool finished
  | { type: "compaction"; tokensBefore: number; tokensAfter: number } // context compacted
  | { type: "retry"; attempt: number; reason: string }   // error, retrying
  | { type: "done"; result: AgentRunResult }             // terminal

The UI subscribes to this event stream and paints the model's output and the tool execution in real time. The non-streaming entry point is internally just "consume the stream to the end, take the final done" — both entry points share one implementation, so there's no second code path to drift out of sync.

What happens inside a turn

Unrolled, one turn looks roughly like this:

Push the user message (possibly with images) into the session history.
Assemble the system prompt, injecting the currently available tools, skill index, and so on.
Parse the model string and resolve it to a concrete Provider and model ID.
Convert all tools into definitions the model understands, and send them out together with the history.
Consume the model's response stream, yield-ing text token by token while collecting any tool calls the model makes.
When the stream ends, look at the model's stop reason:

If it's tool_use, the model wants to call a tool — run the tools, then go back to step 5 and ask the model again.
Otherwise the turn is over — assemble the result, yield done, and return.

There's one invariant you must hold here: every tool call the model makes must be immediately followed in the history by a matching tool result. The model API enforces this pairing hard — break it and the next request will either error out or just hang. We'll come back to this when we talk about session self-healing.

How a tool call gets routed back

The model doesn't execute tools itself; it only says "I'd like to call read_file with these arguments." Once the runner picks up that intent:

for (const call of toolUseBlocks) {
  yield { type: "tool_start", name: call.name, input: call.input };

  const tool = this.tools.get(call.name);
  const ctx = { workingDir, signal, state: { sandboxEnv } };
  const result = await tool.execute(call.input, ctx);

  // append the result to the session as a tool-result message
  session.addToolResult(call.id, result);

  yield { type: "tool_end", name: call.name, result: result.content };
}

Tools run sequentially, results are written back to the history in the order the model declared them, and then the model is asked again with those results in hand. Seeing the results, the model might call another tool or just give its final answer. This "ask → call → answer → ask again" loop is exactly what lets an agent complete multi-step tasks.

One detail deserves its own mention: some tools return images — screenshots, generated pictures. But many models don't accept images in the tool-result channel. Orkas handles this by splitting the image into a separate user message placed after the tool result — the model first reads "the tool returned this text," then on the very next turn sees the corresponding image. A small compromise that routes around capability differences between Providers.

What to do when context is about to overflow

The wall long tasks hit most often is the context window. Orkas doesn't wait until it's full — it sets a 60% watermark: after each round of tools, it estimates how much of the window the current tokens occupy, and once that passes 60% it proactively triggers compaction.

Compaction itself asks the model to summarize the earlier conversation, then replaces the old messages with that summary, keeping only the most recent tail. Sounds simple, but there's a trap: after the swap, the retained tail must not start with an "orphan tool result" — there can't be a "result with no matching call," or you've broken the pairing invariant again. So the compaction logic makes sure the cut lands on a clean boundary.

There's a more interesting choice worth unpacking here: why the coarse "summarize the whole block at 60%" approach, rather than something more fine-grained — scoring each message and trimming by importance, structured extraction from tool outputs, maintaining a layered memory tree? Those approaches look great in papers, but we deliberately didn't go that way, for three reasons.

First, caching. The model's prompt cache hits by prefix: as long as the history's prefix is unchanged, that span hits the cache, saving both money and latency. Fine-grained compaction constantly rewrites the middle of the history, which means repeatedly shattering the cached prefix — every edit forces a large re-prefill. The "leave it alone, then compact once at the watermark" strategy keeps the prefix stable across the vast majority of turns, with only that single compaction invalidating it. Far friendlier to the cache.

Second, complexity. That "every tool call must be paired" invariant we keep hammering on — the more finely you trim history, the more likely you break it in some corner. A coarse summary only has to protect one clean cut point; there are an order of magnitude fewer places to get it wrong. One fewer class of edge case is one fewer class of production incident.

Third, riding the dividend of better models. Context windows have grown steadily over the past couple of years, and models handle long context better and better. Pouring effort today into an elaborate compaction algorithm is essentially fighting a problem that's shrinking — odds are you finish tuning it just as the next generation doubles its window, and your complexity becomes pure liability. Conversely, handing the summarization to the model itself gets automatically better as the model improves: the better it is at picking out what matters, the higher the summary quality, and we don't change a line. Complexity the model can carry for you is complexity you shouldn't carry yourself.

Token estimation hides one easy-to-miss issue: Chinese. Estimate Chinese using English intuitions (roughly one token per few characters) and you'll badly undercount. Orkas weights CJK characters separately in its estimate; otherwise the watermark for an all-Chinese conversation reads wrong, and compaction won't fire when it should.

Errors and retries

Running on the user's machine and depending on an external model API, errors are the norm, not the exception. The runner sorts them into a few classes and treats each differently:

Retryable: rate limits, timeouts, dropped connections, 5xx. Exponential backoff with jitter, capped at 30 seconds; if it's a rate limit and the server sent retry-after, honor it.
Non-retryable: things like auth failures — no number of retries will help, so error out immediately.
Special: context overflow. Try compaction first, retry once after, and only error out if that still fails.

There's one more class: "the tool itself failed." This doesn't sink the whole turn — a tool failure is itself information for the model, which, seeing "that command errored," can perfectly well try a different approach. The harness distinguishes these transient tool errors from real faults: it neither interrupts the flow nor loses them — they show up in after-the-fact stats. (That data later feeds the self-evolution mechanism, which is the subject of the next article.)

The external cancellation signal (AbortSignal) is checked at every key point. The user hits "stop," and the current turn halts immediately — no new retries get kicked off.

Tool abstraction: simple enough to extend

The tool interface is deliberately thin:

interface AgentTool {
  readonly name: string;
  readonly description: string;          // shown to the model
  readonly inputSchema: Record<string, unknown>;  // JSON Schema to constrain inputs
  execute(input: Record<string, unknown>, ctx: ToolContext): Promise<ToolResult>;
}

A tool is just "a name + a description for the model + an input schema + an execute function." The built-ins — read file, write file, run a shell command, web search and fetch — all implement this interface. The desktop layer stacks a batch of locally-flavored tools on top (knowledge-base search, image generation, calling external connectors), but the interface is the same one.

The payoff of a thin interface is that where a tool comes from is irrelevant to the runner: built-in, user-defined, or loaded from a skill — they're all the same kind of thing, registered into one Map<string, AgentTool> and converted into model-readable definitions each turn.

Side-effecting tools like shell commands go through an isolated executor: timeouts, output-length limits, a command blocklist, and environment variables passed in separately rather than mutating the process's global environment — the latter would leak into a swarm of child processes and, in a multi-process architecture like Electron, can easily break startup.

The Provider layer: flattening many models into one interface

Users' model preferences are all over the map, and a product can't hard-bind to one vendor. Below the harness, Orkas lays down a Provider abstraction that unifies different vendors' models behind a single interface:

interface LLMProvider {
  readonly id: string;
  complete(params: CompletionParams): Promise<CompletionResult>;
  stream(params: CompletionParams): AsyncIterable<StreamEvent>;
  validateAuth(): Promise<boolean>;
}

The runner above only ever talks to this interface; it has no idea which vendor is behind it. A registry handles routing by model string: an explicit provider/model form is split directly; a bare model name is attributed by prefix. Auth (an API key or OAuth token) is managed here too, and an expired OAuth token is refreshed automatically.

Flattening many models, the real headache isn't text completion — it's the corners where vendors' semantics disagree. Two examples that bit us.

One is preserving thinking blocks across vendors. Reasoning models emit a span of "thinking" content; some vendors encrypt it and require you to echo it back verbatim, others represent it with a different set of fields. If a user switches from vendor A to vendor B mid-conversation, the signature on that thinking span in the history no longer matches. The fix is to stamp every message in the history with "which model produced this," so the transform layer can decide whether to keep it verbatim: same model, keep it; different model, downgrade per the rules.

The other is the prompt cache. Across turns of one session the prefix is highly repetitive, and caching it saves meaningful cost and latency. The implementation passes the session ID as the cache key to vendors that support it, handling each vendor's key-length limits along the way (truncate or hash if it's too long, say).

These are all grunt work — but it's precisely this layer of grunt work that lets the runner above pretend "there's only one kind of model."

Memory: two mechanisms, each for its own job

"Memory" in Orkas is actually two parallel mechanisms solving two completely different problems. One is a retrieval-based knowledge base, for the bulk material you "go look up when needed." The other is cross-session memory, for the small set of key facts you "should always have in mind." Many products mash these two together; keeping them separate makes things much clearer.

Knowledge base: hybrid retrieval

The first mechanism targets content that's large but only occasionally relevant — the user's documents, past notes, domain knowledge. This is a local knowledge base with vector retrieval, in two backends: a lightweight pure-memory version (for tests and ephemeral use), and a version persisted to a local database (for production, with full-text indexing and vectors).

Data comes in along this path:

documents → chunk on line boundaries (with overlap) → dual indexing
                                          ├─ full-text index (keywords, no embedding cost)
                                          └─ vector index (if an embedding model is configured)

Chunks are cut on line boundaries with a little overlap between them, to avoid slicing a complete piece of meaning down the middle. Retrieval is hybrid: a vector pass (semantically close) and a keyword pass (literal hits), with the two result sets merged via RRF (Reciprocal Rank Fusion):

score = Σ  1 / (k + rank_i)

The higher a result ranks within one pass, the more it contributes; summed across both passes, you both honor semantic relevance and don't lose exact literal matches. The vector and keyword weights are tunable, defaulting to favor semantics. After merging, results are deduplicated by "(document, start line)," keeping only the best one per location, then anything below a threshold is cut, returning the top-K.

Why not rely on vectors alone? Because vector retrieval often face-plants on proper nouns, code symbols, and exact literal strings — queries that aren't semantically special but where the literal matters a lot — while keyword-only can't catch "same meaning, different phrasing." Running both is a very practical trade-off between retrieval quality and cost.

Cross-session memory: keeping the user in mind

The knowledge base solves "too much material to hold." But there's another class of thing — tiny in volume, yet it must stay in mind at all times: who this user is, what they prefer, what was agreed last time. These shouldn't depend on retrieval to "get lucky and recall" — they should be present every single turn.

For this, Orkas builds a separate layer of cross-session memory, split by content into two parts:

User profile: stable facts about the person — role, preferences, communication style, tech stack.
Fact notes: durable facts about the work — decisions, milestones, project conventions.

Both are small, each with a hard cap of a few thousand characters, which forces them to keep only what's genuinely useful long-term. They don't go through retrieval; instead they're frozen directly into the system prompt at the start of every turn — meaning the agent simply "knows" these things, without having to remember to go look them up. That's exactly the opposite stance from the knowledge base: the knowledge base is "fetch only when needed, gone after," cross-session memory is "always present, always visible."

Writes go through a dedicated memory tool that the model calls when it judges, mid-conversation, that "this is worth remembering long-term," supporting add, substring-replace, and delete. What to save and what not to is spelled out clearly in the tool's description: user corrections and preferences are top priority; durable decisions and conventions get saved; while the current task's transient state, one-off debug info, and anything easily re-discoverable do not — memory is for "durable facts about the user and the project," not for "where I got to this time."

There's an easy-to-overlook but quite important detail: a security scan runs before every write. This content enters the system prompt verbatim and persists across sessions for a long time — effectively a durable injection surface. So every memory about to be written to disk is first scanned for suspicious patterns — classic prompt-injection phrasing ("ignore all previous instructions" and the like), commands trying to exfiltrate keys, invisible unicode characters hidden in the text — and a match is rejected outright. With deduplication and over-limit trimming on top, this memory layer stays useful without becoming a liability.

Together, the two mechanisms cover both ends — "huge but occasional" and "small but constant": the knowledge base handles the former, cross-session memory the latter. Add to that the agent's understanding of itself (the subject of the next article), and an Orkas agent walks in carrying three kinds of memory at once — about the material, about the user, and about itself.

Sessions: built to crash, built to heal

A session manages the message history. The basic version is just an in-memory array of messages with history trimming and compaction. But anything running on a user's machine has to assume it can be killed at any moment — the user quits the app, the system reboots, a watchdog timeout takes the process out. So production uses a persistent session, written to a local JSONL file, one message per line.

There are two write strategies: appending a new message uses an atomic append; anything that rewrites the whole file (compaction, clearing) uses "write a temp file + atomic rename." That way, even if power cuts mid-write, you never leave half of a corrupted record behind.

The most interesting piece is healing orphaned tool calls. Back to that pairing invariant: the model makes a tool call, the harness executes it, the result gets written back — interrupt any of those three steps and you leave an orphan on disk, "a call with no result." Load that session next time and send it as-is to the model, and the API will either reject it or hang.

The healing logic runs every time a session is loaded from disk, and it's idempotent:

Scan all assistant messages and collect the tool-call IDs they made.
Look forward for the matching tool results.
For any call missing a matching result, synthesize one marked "interrupted."
Along the way, align the result order to the call declaration order, and drop any orphan results that have no matching call.

After this pass, the session is guaranteed to be in a state that satisfies the API's pairing requirement and is safe to send. The mechanism looks unremarkable, but it's the safety net that keeps "a user's conversation doesn't lock up permanently just because of one crash."

A few decisions that mattered in hindsight

Stringing this all together, a few decisions look especially valuable after the fact.

Generators as the primary interface. Streaming and non-streaming share one implementation, intermediate state surfaces naturally, and the UI can paint as much detail as it wants. This saved a whole class of inconsistency bugs that "implement non-streaming first, bolt on streaming later" would have created.

Compact at 60%, not when full. It leaves headroom for compaction itself (which also costs a model call) and avoids scrambling at the last moment.

The pairing invariant runs through everything. From the compaction cut point, to writing to disk, to load-time healing, every place that touches the session holds the same rule. With one rule, no spot has to invent its own patch logic.

Grunt work concentrated in the Provider layer. All the cross-vendor awkwardness — thinking blocks, cache keys, capability differences — gets digested in this one layer, in exchange for a clean runner above. Add a new model vendor someday and the change barely spills out.

Wrapping up

Orkas's harness has no stunning algorithm. Its value is in taking "make an agent run reliably in a real environment" and splitting it into a set of modules with clean boundaries, each owning one slice: the runner owns the loop and retries, tools own capability, the Provider layer owns flattening many models, memory owns retrieval, the session owns persistence and healing. None of them is complex on its own; only together do they hold up something people use every day.

If there's anything to take away: make the run loop a streaming generator and intermediate state gets much easier to handle; once a core invariant is set (like "tool calls must be paired"), hold it consistently across compaction, disk writes, and loading — don't let any corner be the exception; concentrate cross-vendor grunt work in one layer and keep it out of business logic; and — most plainly of all — assume your process will be killed at the worst possible moment, and write the healing for that moment ahead of time.

The next article gets into a more interesting part of Orkas: how this agent learns from its own use, distills experience into reusable skills, and slowly makes itself more useful.

Qualquer pessoa que tenha enviado um produto para agente conhece a sensação: fazer uma demonstração funcionar é rápido, mas transformá-la em algo em que o usuário confia todos os dias — algo que funciona o dia todo em sua própria máquina sem cair — é difícil. E a parte difícil não é conectar o modelo. É toda a camada que fica ao redor do modelo.

Essa camada tem alguns nomes; o que usarei é o Agent Harness. Ele fica entre o "grande modelo" e os "recursos do produto" e é o verdadeiro tempo de execução: transforma uma única solicitação do usuário em rodada após rodada de conversa com o modelo, encadeando chamadas de ferramenta entre elas, alimentando os resultados, compactando o contexto antes que ele transborde, tentando novamente através de soluços de rede e recuperando a conversa mesmo após o processo travar. O modelo pensa; o arnês faz com que esse pensamento se transforme em uma sequência confiável de ações.

Orkas é um aplicativo de agente de desktop executado na própria máquina do usuário e seu funcionamento fica inteiramente no cliente. Este artigo explica como essa camada é construída: como ela é dividida em camadas, como é o loop de execução, como as ferramentas e os modelos são abstraídos e como a memória e as sessões são tratadas. Os detalhes do código foram eliminados e generalizados, mas a estrutura de engenharia é real.

As camadas

Coloque um produto de agente na horizontal e você obterá aproximadamente estas camadas, empilhadas de baixo para cima:

┌───────── ────────────────────────────────┐ │ Recursos do produto (chat/habilidades/conectores/sincronização) │ ├──────────────────── ─────────────────────┤ │ Agente Harness (loop de execução / ferramentas / sessão) │ ├──────────────────── ─────────────────────┤ │ Abstração de provedor (unificar muitos fornecedores LLM) │ ├──────────────────── ─────────────────────┤ │ Infraestrutura (tipos/erros/registro/configuração) │ └─────────────────────────── ──────────────┘

Há uma escolha importante aqui: toda a inferência do modelo acontece no cliente. O aplicativo de desktop não é um thin client – ele segura o chicote e chama o modelo diretamente. O servidor lida apenas com contas, sincronização de vários dispositivos e cobrança; ele não executa o agente. Essa decisão moldou quase tudo no downstream: as sessões chegam ao disco local, as ferramentas operam diretamente no diretório de trabalho do usuário e os dados confidenciais nunca saem da máquina.

O chicote em si se divide em algumas partes: o loop de execução (executor), a sessão, as ferramentas, a camada do Provedor e a memória. Vamos analisá-los um de cada vez.

O loop de execução: um gerador de streaming

O coração do arnês é o corredor. Em uma frase, o que ele faz é: falar com o modelo repetidamente até que ele diga "Terminei".

Ele é implementado como um gerador assíncrono e essa escolha é importante. A execução de um único agente é muito mais do que "enviar uma solicitação e aguardar um resultado". Muita coisa acontece nesse meio tempo - o modelo está emitindo tokens, quer chamar uma ferramenta, a ferramenta foi finalizada, o contexto demorou o suficiente para acionar a compactação, a rede falhou e estamos tentando novamente. Com retornos de chamada ou promessas simples, esses estados intermediários são difíceis de serem apresentados de forma clara ao chamador. Como geradores, todos eles se tornam um fluxo de eventos gerados por yield:

tipo AgentRunEvent =
  | { tipo: "texto_delta"; text: string } // modelo emitindo tokens
  | { tipo: "tool_start"; nome: sequência; input: desconhecido } // uma ferramenta começa a ser executada
  | { tipo: "tool_end"; nome: sequência; resultado: string } // uma ferramenta finalizada
  | { tipo: "compactação"; tokensAntes: número; tokensAfter: number } // contexto compactado
  | { tipo: "tentar novamente"; tentativa: número; razão: string } // erro, tentando novamente
  | { tipo: "concluído"; resultado: AgentRunResult } // terminal

A IU se inscreve nesse fluxo de eventos e pinta a saída do modelo e a execução da ferramenta em tempo real. O ponto de entrada sem streaming é internamente apenas "consumir o stream até o fim, fazer o feito" — ambos os pontos de entrada compartilham uma implementação, portanto não há um segundo caminho de código para ficar fora de sincronia.

O que acontece dentro de uma curva

Desenrolado, um turno fica mais ou menos assim:

Envie a mensagem do usuário (possivelmente com imagens) para o histórico da sessão.
Monte o prompt do sistema, injetando as ferramentas atualmente disponíveis, o índice de habilidades e assim por diante.
Analise a string do modelo e resolva-a para um provedor concreto e um ID de modelo.
Converta todas as ferramentas em definições que o modelo entenda e envie-as junto com o histórico.
Consuma o fluxo de resposta do modelo, produzindo-ing token de texto por token enquanto coleta quaisquer chamadas de ferramenta feitas pelo modelo.
Quando o fluxo terminar, observe o motivo de parada do modelo:

Se for tool_use, o modelo quer chamar uma ferramenta - execute as ferramentas, depois volte para a etapa 5 e pergunte ao modelo novamente.
Caso contrário, a virada acabou - monte o resultado, rendimento feito e retorne.

Há uma invariante que você deve manter aqui: cada chamada de ferramenta que o modelo faz deve ser imediatamente seguida no histórico por um resultado de ferramenta correspondente. A API do modelo impõe esse emparelhamento com força – interrompa-o e a próxima solicitação apresentará um erro ou simplesmente travará. Voltaremos a isso quando falarmos sobre a autocura da sessão.

Como uma chamada de ferramenta é encaminhada de volta

O modelo não executa ferramentas por si só; diz apenas "Gostaria de chamar read_file com esses argumentos." Assim que o executor captar essa intenção:

for (chamada const de toolUseBlocks) {
  rendimento {tipo: "tool_start", nome: call.name, entrada: call.input };

  ferramenta const = this.tools.get(call.name);
  const ctx = {workingDir, sinal, estado: { sandboxEnv } };
  resultado const = aguardar ferramenta.execute(call.input, ctx);

  // acrescenta o resultado à sessão como uma mensagem de resultado da ferramenta
  session.addToolResult(call.id, resultado);

  rendimento {tipo: "tool_end", nome: call.name, resultado: resultado.content };
}

As ferramentas são executadas sequencialmente, os resultados são gravados no histórico na ordem em que o modelo os declarou e, em seguida, o modelo é questionado novamente com esses resultados em mãos. Vendo os resultados, o modelo pode chamar outra ferramenta ou apenas dar a resposta final. Esse ciclo "perguntar → ligar → atender → perguntar novamente" é exatamente o que permite que um agente conclua tarefas de várias etapas.

Um detalhe merece destaque: algumas ferramentas retornam imagens — capturas de tela, imagens geradas. Mas muitos modelos não aceitam imagens no canal ferramenta-resultado. Orkas lida com isso dividindo a imagem em uma mensagem de usuário separada colocada após o resultado da ferramenta — o modelo primeiro lê "a ferramenta retornou este texto" e, no turno seguinte, vê a imagem correspondente. Um pequeno compromisso que contorna as diferenças de capacidade entre os provedores.

O que fazer quando o contexto está prestes a transbordar

A tarefa mais longa atingida com mais frequência é a janela de contexto. Orkas não espera até que esteja cheio — ele define uma marca d’água de 60%: após cada rodada de ferramentas, ele estima quanto da janela os tokens atuais ocupam e, quando essa marca passa de 60%, ele aciona proativamente a compactação.

A própria compactação pede ao modelo para resumir a conversa anterior e, em seguida, substitui as mensagens antigas por esse resumo, mantendo apenas o final mais recente. Parece simples, mas há uma armadilha: após a troca, a cauda retida não deve começar com um "resultado de ferramenta órfã" - não pode haver um "resultado sem chamada correspondente" ou você quebrou o invariante de emparelhamento novamente. Portanto, a lógica de compactação garante que o corte fique em um limite limpo.

Há uma escolha mais interessante que vale a pena analisar aqui: por que a abordagem grosseira de "resumar todo o bloco em 60%", em vez de algo mais refinado - pontuar cada mensagem e cortar por importância, extração estruturada de saídas de ferramentas, manutenção de uma árvore de memória em camadas? Essas abordagens parecem ótimas nos artigos, mas deliberadamente não seguimos esse caminho, por três motivos.

Primeiro, armazenamento em cache. O cache de prompt do modelo atinge por prefixo: desde que o prefixo do histórico permaneça inalterado, esse intervalo atinge o cache, economizando dinheiro e latência. A compactação refinada reescreve constantemente o meio do histórico, o que significa quebrar repetidamente o prefixo em cache – cada edição força um grande repreenchimento. A estratégia "deixe como está e depois compacte uma vez na marca d'água" mantém o prefixo estável na grande maioria das curvas, com apenas uma única compactação invalidando-o. Muito mais amigável para o cache.

Em segundo lugar, complexidade. Essa invariante de que “cada chamada de ferramenta deve ser emparelhada” continuamos martelando – quanto mais detalhadamente você aparar o histórico, maior será a probabilidade de quebrá-lo em algum canto. Um resumo grosseiro só precisa proteger um ponto de corte limpo; há uma ordem de magnitude menos lugares onde errar. Uma classe a menos de caso extremo é uma classe a menos de incidente de produção.

Terceiro, aproveitar os dividendos de modelos melhores. As janelas de contexto têm crescido continuamente nos últimos anos, e os modelos lidam cada vez melhor com contextos longos. Investir esforços hoje em um algoritmo de compactação elaborado é essencialmente lutar contra um problema que está diminuindo – é provável que você termine de ajustá-lo no momento em que a próxima geração dobra sua janela e sua complexidade se torna pura responsabilidade. Por outro lado, entregar o resumo ao próprio modelo fica automaticamente melhor à medida que o modelo melhora: quanto melhor ele for em escolher o que importa, maior será a qualidade do resumo, e não mudamos uma linha. A complexidade que o modelo pode carregar para você é uma complexidade que você não deveria carregar sozinho.

A estimativa de token esconde um problema fácil de ignorar: o chinês. Estime o chinês usando as intuições do inglês (aproximadamente um token para cada poucos caracteres) e você subestimará muito. Orkas pondera os caracteres CJK separadamente em sua estimativa; caso contrário, a marca d'água de uma conversa totalmente em chinês será incorreta e a compactação não será acionada quando deveria.

Erros e novas tentativas

Executando na máquina do usuário e dependendo de um modelo de API externo, os erros são a norma, não a exceção. O executor os classifica em algumas classes e trata cada uma de maneira diferente:

Repetivel: limites de taxa, tempos limite, queda de conexões, 5xx. Backoff exponencial com jitter, limitado a 30 segundos; se for um limite de taxa e o servidor enviou retry-after, respeite-o.
Não é possível repetir: coisas como falhas de autenticação - nenhum número de tentativas ajudará, então elimine o erro imediatamente.
Especial: estouro de contexto. Tente compactar primeiro, tente novamente uma vez depois e só erre se ainda falhar.

Há mais uma classe: "a própria ferramenta falhou." Isso não afunda o turno inteiro - uma falha na ferramenta é em si uma informação para o modelo, que, vendo "aquele comando errado", pode perfeitamente tentar uma abordagem diferente. O chicote distingue esses erros transitórios da ferramenta de falhas reais: ele não interrompe o fluxo nem os perde – eles aparecem em estatísticas posteriores. (Esses dados alimentam posteriormente o mecanismo de autoevolução, que é o assunto do próximo artigo.)

O sinal de cancelamento externo (AbortSignal) é verificado em cada ponto-chave. O usuário clica em "parar" e a curva atual é interrompida imediatamente. Nenhuma nova tentativa é iniciada.

Abstração de ferramenta: simples o suficiente para ser estendida

A interface da ferramenta é deliberadamente fina:

interface AgentTool {
  nome somente leitura: string;
  descrição somente leitura: string;          // mostrado ao modelo
  somente leitura inputSchema: Record;  // Esquema JSON para restringir entradas
  execute(entrada: Record, ctx: ToolContext): Promise;
}

Uma ferramenta é apenas “um nome + uma descrição para o modelo + um esquema de entrada + uma função de execução”. Os recursos integrados – ler arquivo, gravar arquivo, executar um comando shell, pesquisar e buscar na web – todos implementam essa interface. A camada da área de trabalho empilha um lote de ferramentas com sabor local no topo (pesquisa na base de conhecimento, geração de imagens, chamada de conectores externos), mas a interface é a mesma.

A vantagem de uma interface fina é que a origem de uma ferramenta é irrelevante para o executor: integrada, definida pelo usuário ou carregada a partir de uma habilidade. São todas o mesmo tipo de coisa, registradas em um Map e convertidas em definições legíveis pelo modelo a cada turno.

Ferramentas de efeito colateral, como comandos shell, passam por um executor isolado: tempos limite, limites de comprimento de saída, uma lista de bloqueio de comandos e variáveis de ambiente passadas separadamente, em vez de alterar o ambiente global do processo.

A camada Provedor: nivelando muitos modelos em uma interface

As preferências de modelo dos usuários estão em todo o mapa, e um produto não pode ser vinculado a um fornecedor. Abaixo do equipamento, Orkas estabelece uma abstração de Provedor que unifica modelos de diferentes fornecedores por trás de uma única interface:

interface LLMProvider {
  id somente leitura: string;
  complete(params: CompletionParams): Promise;
  stream(params: CompletionParams): AsyncIterable;
  validarAuth(): Promessa;
}

O executor acima só conversa com esta interface; não tem ideia de qual fornecedor está por trás disso. Um registro trata o roteamento por string de modelo: um formulário provedor/modelo explícito é dividido diretamente; um nome de modelo simples é atribuído por prefixo. Auth (uma chave de API ou token OAuth) também é gerenciado aqui, e um token OAuth expirado é atualizado automaticamente.

Achatando muitos modelos, a verdadeira dor de cabeça não é a conclusão do texto, mas sim os cantos onde a semântica dos fornecedores discorda. Dois exemplos que nos marcaram.

Uma delas é preservar bloqueios de pensamento entre os fornecedores. Os modelos de raciocínio emitem uma série de conteúdos “pensantes”; alguns fornecedores o criptografam e exigem que você o reproduza literalmente, outros o representam com um conjunto diferente de campos. Se um usuário mudar do fornecedor A para o fornecedor B no meio da conversa, a assinatura desse intervalo de pensamento no histórico não corresponderá mais. A solução é carimbar cada mensagem no histórico com “qual modelo produziu isso”, para que a camada de transformação possa decidir se deseja mantê-la literalmente: mesmo modelo, mantenha-o; modelo diferente, faça downgrade de acordo com as regras.

O outro é o cache de prompts. Nos turnos de uma sessão, o prefixo é altamente repetitivo e o armazenamento em cache economiza custos e latência significativos. A implementação passa o ID da sessão como a chave de cache para os fornecedores que o suportam, lidando com os limites de comprimento de chave de cada fornecedor ao longo do caminho (truncar ou hash se for muito longo, por exemplo).

Tudo isso é trabalho pesado, mas é precisamente essa camada de trabalho pesado que permite ao corredor acima fingir que "só existe um tipo de modelo".

Memória: dois mecanismos, cada um para sua função

A "memória" no Orkas são na verdade dois mecanismos paralelos que resolvem dois problemas completamente diferentes. Uma delas é uma base de conhecimento baseada em recuperação, para o material em massa que você “procura quando necessário”. A outra é a memória entre sessões, para o pequeno conjunto de fatos importantes que você “deve sempre ter em mente”. Muitos produtos misturam esses dois; mantê-los separados torna as coisas muito mais claras.

Base de conhecimento: recuperação híbrida

O primeiro mecanismo visa conteúdo grande, mas apenas ocasionalmente relevante – documentos do usuário, notas anteriores, conhecimento do domínio. Esta é uma base de conhecimento local com recuperação de vetores, em dois backends: uma versão leve de memória pura (para testes e uso efêmero) e uma versão persistida em um banco de dados local (para produção, com indexação de texto completo e vetores).

Os dados chegam por este caminho:

documentos → pedaços de limites de linha (com sobreposição) → indexação dupla
                                          ├─ índice de texto completo (palavras-chave, sem custo de incorporação)
                                          └─ índice vetorial (se um modelo de incorporação estiver configurado)

Os pedaços são cortados nos limites das linhas com uma pequena sobreposição entre eles, para evitar cortar um pedaço completo de significado ao meio. A recuperação é híbrida: uma passagem de vetor (semanticamente próxima) e uma passagem de palavra-chave (acertos literais), com os dois conjuntos de resultados mesclados via RRF (Reciprocal Rank Fusion):

pontuação = Σ 1 / (k + rank_i)

Quanto mais alta a classificação de um resultado em uma passagem, mais ele contribui; somados nas duas passagens, vocês respeitam a relevância semântica e não perdem correspondências literais exatas. Os pesos do vetor e das palavras-chave são ajustáveis, por padrão para favorecer a semântica. Após a mesclagem, os resultados são desduplicados por "(documento, linha inicial)", mantendo apenas o melhor por local e, em seguida, qualquer coisa abaixo de um limite é cortada, retornando o K superior.

Por que não confiar apenas nos vetores? Porque a recuperação de vetores geralmente se baseia em nomes próprios, símbolos de código e strings literais exatas - consultas que não são semanticamente especiais, mas onde o literal é muito importante - enquanto apenas palavras-chave não conseguem capturar "mesmo significado, fraseado diferente". Executar ambos é uma troca muito prática entre qualidade e custo de recuperação.

Memória entre sessões: mantendo o usuário em mente

A base de conhecimento resolve "muito material para armazenar". Mas há outro tipo de coisa – pequeno em volume, mas que deve estar sempre em mente: quem é esse usuário, o que ele prefere, o que foi acordado da última vez. Eles não deveriam depender de recuperação para "ter sorte e recall" — eles deveriam estar presentes em todos os turnos.

Para isso, Orkas cria uma camada separada de memória entre sessões, dividida por conteúdo em duas partes:

Perfil do usuário: fatos estáveis sobre a pessoa — função, preferências, estilo de comunicação, pilha de tecnologia.
Notas de fatos: fatos duráveis sobre o trabalho — decisões, marcos, convenções do projeto.

Ambos são pequenos, cada um com um limite rígido de alguns milhares de caracteres, o que os obriga a manter apenas o que é genuinamente útil a longo prazo. Eles não passam por recuperação; em vez disso, eles são congelados diretamente no prompt do sistema no início de cada turno – o que significa que o agente simplesmente “sabe” essas coisas, sem ter que se lembrar de procurá-las. Essa é exatamente a postura oposta da base de conhecimento: a base de conhecimento é “buscada somente quando necessário, seguida”, a memória entre sessões é “sempre presente, sempre visível”.

As gravações passam por uma ferramenta de memória dedicada que o modelo chama quando julga, no meio da conversa, que "vale a pena lembrar disso a longo prazo", suportando adição, substituição de substring e exclusão. O que salvar e o que não salvar está claramente escrito na descrição da ferramenta: as correções e preferências do usuário são prioridade máxima; decisões e convenções duradouras são salvas; enquanto o estado transitório da tarefa atual, informações de depuração únicas e qualquer coisa facilmente redescoberto não o fazem - a memória é para "fatos duráveis sobre o usuário e o projeto", não para "onde cheguei neste momento".

Há um detalhe fácil de ignorar, mas muito importante: uma verificação de segurança é executada antes de cada gravação. Esse conteúdo entra no prompt do sistema literalmente e persiste durante as sessões por um longo tempo – efetivamente uma superfície de injeção durável. Assim, cada memória prestes a ser gravada no disco é primeiro verificada em busca de padrões suspeitos - frases clássicas de injeção de prompt ("ignorar todas as instruções anteriores" e similares), comandos que tentam exfiltrar chaves, caracteres unicode invisíveis escondidos no texto - e uma correspondência é rejeitada imediatamente. Com desduplicação e corte acima do limite, essa camada de memória permanece útil sem se tornar um risco.

Juntos, os dois mecanismos cobrem ambos os fins — "enorme, mas ocasional" e "pequeno, mas constante": a base de conhecimento lida com o primeiro, e a memória de sessão cruzada, com o segundo. Acrescente a isso a compreensão que o agente tem de si mesmo (o assunto do próximo artigo), e um agente Orkas entra carregando três tipos de memória ao mesmo tempo: sobre o material, sobre o usuário e sobre si mesmo.

Sessões: criadas para travar, criadas para curar

Uma sessão gerencia o histórico de mensagens. A versão básica é apenas um conjunto de mensagens na memória com recorte e compactação do histórico. Mas qualquer coisa em execução na máquina de um usuário deve presumir que pode ser encerrada a qualquer momento – o usuário sai do aplicativo, o sistema é reinicializado, um tempo limite de watchdog interrompe o processo. Portanto, a produção usa uma sessão persistente, gravada em um arquivo JSONL local, uma mensagem por linha.

Existem duas estratégias de gravação: anexar uma nova mensagem usa um acréscimo atômico; qualquer coisa que reescreva o arquivo inteiro (compactação, limpeza) usa "escrever um arquivo temporário + renomeação atômica". Dessa forma, mesmo que a energia seja cortada no meio da gravação, você nunca deixa para trás metade de um registro corrompido.

A parte mais interessante é curar chamadas de ferramentas órfãs. Voltando à invariante de emparelhamento: o modelo faz uma chamada de ferramenta, o chicote a executa, o resultado é gravado de volta - interrompa qualquer uma dessas três etapas e você deixa um órfão no disco, "uma chamada sem resultado". Carregue essa sessão na próxima vez e envie-a como está para o modelo, e a API a rejeitará ou travará.

A lógica de recuperação é executada sempre que uma sessão é carregada do disco e é idempotente:

Verifique todas as mensagens do assistente e colete os IDs de chamada de ferramenta que eles fizeram.
Aguarde os resultados da ferramenta correspondentes.
Para qualquer chamada sem resultado correspondente, sintetize uma chamada marcada como "interrompida".
Ao longo do caminho, alinhe a ordem dos resultados com a ordem da declaração de chamada e descarte quaisquer resultados órfãos que não tenham chamada correspondente.

Após essa passagem, é garantido que a sessão esteja em um estado que atenda aos requisitos de emparelhamento da API e seja segura para envio. O mecanismo parece normal, mas é a rede de segurança que mantém "a conversa de um usuário não trava permanentemente apenas por causa de uma falha".

Algumas decisões que importaram em retrospectiva

Juntando tudo isso, algumas decisões parecem especialmente valiosas após o fato.

Geradores como interface principal. Streaming e não streaming compartilham uma implementação, o estado intermediário surge naturalmente e a UI pode pintar quantos detalhes desejar. Isso salvou toda uma classe de bugs de inconsistência que teriam criado "implementar o não streaming primeiro e ativar o streaming depois".

Compacta a 60%, não quando estiver cheio. Isso deixa espaço para a compactação em si (o que também custa uma chamada de modelo) e evita embaralhamento no último momento.

A invariante de emparelhamento percorre tudo. Do ponto de corte da compactação à gravação no disco e à recuperação no tempo de carregamento, todos os locais que tocam a sessão mantêm a mesma regra. Com uma regra, nenhum spot precisa inventar sua própria lógica de correção.

Trabalho pesado concentrado na camada Provedor. Toda a estranheza entre fornecedores — bloqueios de pensamento, chaves de cache, diferenças de capacidade — é digerida nesta camada, em troca de um executor limpo acima. Adicione um novo fornecedor de modelo algum dia e a mudança quase não se espalhará.

Concluindo

O equipamento de Orkas não possui algoritmo impressionante. Seu valor está em "fazer um agente funcionar de maneira confiável em um ambiente real" e dividi-lo em um conjunto de módulos com limites claros, cada um possuindo uma fatia: o executor possui o loop e as novas tentativas, as ferramentas possuem a capacidade, a camada do Provedor possui o nivelamento de muitos modelos, a memória possui a recuperação, a sessão possui a persistência e a cura. Nenhum deles é complexo por si só; somente juntos eles sustentam algo que as pessoas usam todos os dias.

Se há algo a tirar: faça do loop de execução um gerador de streaming e o estado intermediário fica muito mais fácil de manusear; uma vez que um invariante principal seja definido (como "chamadas de ferramentas devem ser emparelhadas"), mantenha-o consistentemente durante a compactação, gravações em disco e carregamento - não deixe nenhum canto ser a exceção; concentre o trabalho pesado entre fornecedores em uma camada e mantenha-o fora da lógica de negócios; e - o mais claro de tudo - suponha que seu processo será eliminado no pior momento possível e escreva a cura para esse momento com antecedência.

O próximo artigo aborda uma parte mais interessante de Orkas: como esse agente aprende com seu próprio uso, transforma experiência em habilidades reutilizáveis e lentamente se torna mais útil.

做过 Agent 产品的人大概都有过类似的体感：调通一个 demo 很快，但把它变成一个用户每天敢用、能在自己机器上跑一整天不出岔子的东西，难的根本不是接模型，而是模型之外的那一整层。

那一层有个不太统一的叫法——Agent Harness。它夹在「大模型」和「业务功能」之间，是真正的运行时：负责把一次用户请求翻译成一轮又一轮和模型的对话，在中间穿插工具调用、把结果喂回去、在上下文要爆的时候做压缩、在网络抖动时重试、在进程崩溃后还能把对话救回来。模型负责「想」，harness 负责让这些「想」真正落地成一连串可靠的动作。

Orkas 是一个跑在用户本机的桌面 Agent 应用，它的 harness 完全做在客户端。这篇文章拆一下这一层是怎么搭起来的：整体怎么分层、运行循环长什么样、工具和模型怎么抽象、记忆和会话又是怎么处理的。代码细节做了脱敏和泛化，但工程结构是真实的。

先看分层

把一个 Agent 产品摊开，大致是这么几层自下而上叠起来的：

┌─────────────────────────────────────────┐
│  业务功能层  (会话 / 技能 / 连接器 / 同步)   │
├─────────────────────────────────────────┤
│  Agent Harness  (运行循环 / 工具 / 会话)    │
├─────────────────────────────────────────┤
│  Provider 抽象层  (统一多家大模型)           │
├─────────────────────────────────────────┤
│  基础设施  (类型 / 错误 / 日志 / 配置)        │
└─────────────────────────────────────────┘

这里有个影响深远的取舍：所有的模型推理都发生在客户端。桌面端不是一个瘦客户端，它自己持有 harness，直接发起对模型的调用；服务端只管账号、多端同步、计费这些事，本身不跑 Agent。这个决定塑造了后面几乎所有的设计——会话要落到本地磁盘、工具直接操作用户的工作目录、敏感数据不离开这台机器。

harness 内部又可以切成几块：运行循环（runner）、会话（session）、工具（tools）、Provider、记忆（memory）。下面一块块说。

运行循环：一个流式生成器

整个 harness 的心脏是 runner。它做的事用一句话概括就是：反复地和模型对话，直到模型说「我说完了」。

实现上它是一个异步生成器（async generator）。这个选择很关键。一次 agent 运行远不是「发请求、等结果」这么简单，中间会发生很多事——模型在吐字、要调一个工具了、工具跑完了、上下文太长触发了压缩、网络错了在重试。如果用回调或者 Promise，这些中间状态很难干净地透给调用方。换成生成器，它们就都变成了一串 yield 出去的事件：

type AgentRunEvent =
  | { type: "text_delta"; text: string }              // 模型在逐字输出
  | { type: "tool_start"; name: string; input: unknown } // 开始执行某个工具
  | { type: "tool_end"; name: string; result: string }   // 工具执行完毕
  | { type: "compaction"; tokensBefore: number; tokensAfter: number } // 触发了上下文压缩
  | { type: "retry"; attempt: number; reason: string }   // 出错了，正在重试
  | { type: "done"; result: AgentRunResult }             // 终态

UI 层订阅这个事件流，就能实时把模型的输出和工具的执行过程画到屏幕上。而非流式的调用入口，内部其实就是把这个流消费完、只取最后的 done——两个入口共用一套逻辑，不存在两份会跑偏的实现。

一个 turn 里发生了什么

把循环展开，一轮（turn）大致是这样：

把用户消息（可能带图片）塞进会话历史；
拼系统提示词，这里会注入当前可用的工具、技能索引等；
解析模型字符串，定位到具体的 Provider 和模型 ID；
把所有工具转成模型能理解的定义，连同历史一起发出去；
消费模型返回的流，逐字 yield 文本，同时收集模型发起的工具调用；
流结束后看模型的停止原因：

如果是 tool_use，说明模型想调工具，进入工具执行环节，然后回到第 5 步再问一次模型；
否则说明这轮结束了，组装结果、yield done、返回。

这里有一条必须守住的不变式：模型每发起一个工具调用，历史里就必须紧跟一条对应的工具结果。模型 API 对这种配对有硬性要求，缺了配对，下一次请求要么报错、要么直接挂住。后面讲会话自愈时还会回到这一点。

工具调用是怎么转回去的

模型不会自己执行工具，它只会说「我想调用 read_file，参数是这些」。runner 接到这个意图后：

for (const call of toolUseBlocks) {
  yield { type: "tool_start", name: call.name, input: call.input };

  const tool = this.tools.get(call.name);
  const ctx = { workingDir, signal, state: { sandboxEnv } };
  const result = await tool.execute(call.input, ctx);

  // 把结果作为一条工具结果消息追加进会话
  session.addToolResult(call.id, result);

  yield { type: "tool_end", name: call.name, result: result.content };
}

工具串行执行，结果按模型声明的顺序写回历史，然后带着这些结果再问一次模型。模型看到工具结果后，可能继续调下一个工具，也可能直接给出最终答复。这个「问—调—答—再问」的环，就是 agent 能完成多步任务的根本。

有个细节值得单独拎出来：有些工具会返回图片，比如截图、生成图。但不少模型的工具结果通道并不支持塞图片。Orkas 的处理是把图片拆到工具结果之后的一条独立用户消息里——模型先读到「工具返回了这段文字」，紧接着下一轮就看到对应的图。一个小妥协，绕开了不同 Provider 之间的能力差异。

上下文要爆了怎么办

长任务最容易撞上的墙就是上下文窗口。Orkas 没有等撑满才处理，而是设了一道 60% 的水位线：每跑完一轮工具，估一下当前 token 占了窗口多少，超过六成就主动触发压缩。

压缩本身是让模型给前面的对话做一份摘要，再用这份摘要替换掉旧消息，只保留尾部最近的几轮。听起来简单，但有个坑：替换之后，保留的尾部不能以一条「孤儿工具结果」开头，也就是不能出现「有结果没有对应调用」的情况，否则又违反了前面那条配对不变式。所以压缩逻辑会确保切割点落在一个干净的边界上。

这里有个更值得展开的选择：为什么是「到 60% 就整段摘要」这种粗粒度的做法，而不是去做更精细的上下文压缩——比如逐条给消息打分、按重要性裁剪、对工具输出做结构化抽取、维护一棵分层的记忆树？这些方案在论文里都很漂亮，但我们刻意没走那条路，原因有三。

一是缓存。模型那边的 prompt cache 是按前缀命中的：只要历史的前缀不变，这一段就能吃到缓存，省钱又省延迟。精细压缩会不停改写历史的中段，等于反复把缓存前缀打碎，每动一次就要重新预填一大段。而「平时完全不动、到水位线才一次性压缩」的策略，绝大多数轮次里前缀是稳定的，只有压缩那一下会失效一次——对缓存友好太多。

二是复杂度。前面反复强调的那条「工具调用必须配对」的不变式，你越是精细地去裁剪历史，就越容易在某个边角上把它破坏掉。粗粒度摘要只需要守住一个干净的切割点，能出错的地方少了一个数量级。少一类边界情况，就少一类线上事故。

三是吃模型能力提升的红利。上下文窗口这两年是一路在变大的，模型处理长上下文的能力也在变强。今天花大力气写一套精巧的压缩算法，本质上是在跟一个正在缩小的问题较劲——很可能你刚调优完，下一代模型窗口翻一倍，这套复杂度就成了纯负债。反过来，把压缩这件事交给模型自己做摘要，它会随着模型变强而自动变好：模型越会抓重点，摘要质量就越高，我们一行代码都不用改。能让模型替你扛的复杂度，就别自己背。

token 估算这块还藏了个容易被忽略的问题：中文。如果按英文的经验（大致一个 token 对应几个字符）去估中文，会严重低估。Orkas 的估算对 CJK 字符单独算权重，否则纯中文会话的水位线会一直测不准，该触发压缩的时候触发不了。

错误和重试

跑在用户本机、依赖外部模型 API，出错是常态而不是意外。runner 把错误分成几类区别对待：

可重试的：限流、超时、连接断开、5xx。指数退避加抖动，上限 30 秒；如果是限流且服务端给了 retry-after，就听它的。
不可重试的：鉴权失败这类，重试多少次都没用，直接报错返回。
特殊的：上下文溢出。先尝试压缩，压完再试一次，实在不行才报错。

还有一类是「工具自己失败了」。这种不会让整轮挂掉——工具失败本身就是给模型的信息，模型看到「这个命令报错了」，完全可以换个方式再来。harness 会把这种瞬时工具错误和真正的故障区分开，既不打断流程，又能在事后统计里反映出来。（这部分数据后来还喂给了自演进机制，那是下一篇的内容了。）

外部传进来的取消信号（AbortSignal）在每个关键节点都会检查。用户点了「停止」，当前这轮就立刻收手，不会再发起新的重试。

工具抽象：够简单才扩展得动

工具的接口被刻意做得很薄：

interface AgentTool {
  readonly name: string;
  readonly description: string;          // 给模型看的说明
  readonly inputSchema: Record<string, unknown>;  // JSON Schema，用来约束入参
  execute(input: Record<string, unknown>, ctx: ToolContext): Promise<ToolResult>;
}

一个工具就是「名字 + 给模型的说明 + 参数 schema + 一个执行函数」。内置的几个——读文件、写文件、跑 shell 命令、网页搜索和抓取——都按这个接口实现。桌面端在这之上又叠了一批和本地能力相关的工具，比如知识库检索、生成图片、调用外部连接器，但接口是同一套。

薄接口的好处是，工具来自哪里对 runner 来说无所谓：内置的、用户自定义的、从技能里加载的，进到 runner 都是同一种东西，统一注册进一张 Map<string, AgentTool>，每轮转成模型能读的定义发出去。

跑 shell 命令这类有副作用的工具，走的是一个隔离的执行器：有超时、有输出长度限制、有命令黑名单，环境变量单独传进去而不是去改进程的全局环境——后者会泄漏给一堆子进程，在 Electron 这种多进程架构里很容易把启动搞崩。

Provider 层：把多家模型抹平成一个接口

用户的模型偏好五花八门，产品不可能绑死一家。Orkas 在 harness 下面垫了一层 Provider 抽象，把不同厂商的模型统一成一个接口：

interface LLMProvider {
  readonly id: string;
  complete(params: CompletionParams): Promise<CompletionResult>;
  stream(params: CompletionParams): AsyncIterable<StreamEvent>;
  validateAuth(): Promise<boolean>;
}

上层的 runner 永远只跟这个接口打交道，根本不知道背后接的是哪家。一个注册表（registry）负责按模型字符串路由：provider/model 这种显式写法直接拆；只写模型名的，按前缀推断归属。鉴权信息（API key 或 OAuth token）也由它统一管理，OAuth token 过期了会自动刷新。

抹平多家模型，真正麻烦的不是文本补全，而是那些各家语义不一致的角落。举两个被坑过的例子。

一个是思考块（thinking）的跨厂商保持。带推理的模型会产出一段「思考」内容，有的厂商把它加密、要求你原样回传，有的用另一套字段表示。如果用户在一轮对话里从 A 家切到 B 家，历史里那段思考的签名就对不上了。处理办法是给历史里每条消息都盖上「它当时是哪个模型产生的」这个戳，转换层据此判断要不要原样保留：同模型才保，跨模型就按规则降级。

另一个是提示词缓存（prompt cache）。同一个会话多轮之间，前缀是高度重复的，把它缓存住能省下可观的成本和延迟。实现上是把会话 ID 作为缓存键传给支持的厂商，顺带处理各家对键长度的限制，比如太长就截断或哈希。

这些都是脏活，但正是这层脏活，让上面的 runner 能假装「模型只有一种」。

记忆：两套机制，各管一段

「记忆」在 Orkas 里其实是两套并行的机制，解决的是两类完全不同的问题。一套是检索式的知识库，对应「需要时去翻」的大块资料；另一套是跨会话记忆，对应「应该一直记着」的少量关键事实。很多产品把这两件事混成一团，分开看会清楚很多。

知识库：混合检索

第一套面向的是体量大、但只是偶尔用得上的内容——用户的文档、过往的笔记、领域知识。这部分是一套带向量检索的本地知识库，两种后端：轻量的纯内存版（测试和临时用），和落到本地数据库的持久版（生产用，带全文索引和向量）。

数据进来的链路是这样的：

文档 → 按行边界切块（带重叠） → 双路索引
                                ├─ 全文索引（关键词，无嵌入成本）
                                └─ 向量索引（若配了嵌入模型）

切块按行边界切、块之间留一点重叠，避免把一段完整语义从中间劈开。检索时走的是混合检索：向量搜一遍（语义相近），关键词搜一遍（字面命中），两路结果用 RRF（Reciprocal Rank Fusion，倒数排名融合）合并：

score = Σ  1 / (k + rank_i)

某条结果在一路里排名越靠前，贡献的分越高；两路加起来，既照顾到语义相关、又不丢字面精确匹配。向量和关键词各自的权重可调，默认偏向语义。合并后按「文档 + 起始行」去重，每个位置只留最好的那条，再砍掉低于阈值的，返回 top-K。

为什么不纯靠向量？因为向量检索对专有名词、代码符号、精确字面串这类「语义上不特殊但字面很重要」的查询经常翻车；而纯关键词又抓不住「换了种说法但意思一样」的情况。两路一起上，是检索质量和成本之间一个很实在的折中。

跨会话记忆：把用户记在心上

知识库解决的是「资料太多记不下」。但还有另一类东西，量很小，却必须一直挂在脑子里——这个用户是谁、他偏好什么、上次定下的约定是什么。这些不该靠检索去「碰运气召回」，而应该每一轮都在场。

为此 Orkas 单独做了一层跨会话记忆，按内容分成两份：

用户画像：角色、偏好、沟通风格、技术栈这类关于「人」的稳定信息；
事实笔记：决定、里程碑、项目约定这类关于「事」的长期事实。

两份都很小，各自有几千字符的硬上限，逼着它只留真正长期有用的东西。它们不走检索，而是在每轮对话开始时直接冻进系统提示词——也就是说，agent 天然就「知道」这些事，不需要先想起来再去查。这跟知识库正好是两种取向：知识库是「用时才捞、捞完就走」，跨会话记忆是「一直在场、人人能看见」。

写入由一个专门的记忆工具完成，模型在对话里判断「这条值得长期记」时调用它，支持新增、按子串替换、删除。什么该记、什么不该记，工具说明里划得很清楚：用户的纠正和偏好优先级最高，长期有效的决定和约定要记；而当前任务的临时状态、一次性的调试信息、能轻易重新查到的东西，一律不记——记忆是给「关于用户和项目的持久事实」用的，不是给「这次干到哪了」用的。

有个容易被忽略、但相当重要的细节：写入前要过一道安全扫描。这些内容会原样进系统提示词、还跨会话长期留存，等于是一块持久的注入面。所以每条要落盘的记忆都会先扫一遍可疑模式——典型的提示词注入话术（「忽略以上所有指令」之类）、想偷密钥的命令、藏在文本里的不可见 unicode 字符，命中就直接拒写。再配上去重和超限自动裁剪，这层记忆才既好用、又不至于变成风险点。

两套机制合起来，正好兜住了「海量但偶尔用」和「少量但一直要」这两端：知识库管前者，跨会话记忆管后者。再叠加上下一篇要讲的、agent 对自己的认知，一个 Orkas agent 是同时带着三种记忆上场的——关于资料的、关于用户的、关于它自己的。

会话：能崩、能自愈

会话（session）管的是消息历史。基础版就是内存里一个消息数组，带历史裁剪和压缩。但跑在用户机器上的东西，得假设它随时会被杀掉——用户关了 app、系统重启、看门狗超时把进程干掉。所以生产用的是持久会话，落到本地一个 JSONL 文件，一行一条消息。

落盘策略分两种：追加新消息走原子追加（append）；压缩或清空这种要重写整个文件的，走「写临时文件 + 原子改名」。这样即便写到一半断电，也不会留下半条损坏的记录。

最值得说的是孤儿工具调用的自愈。回到前面那条配对不变式：模型发起工具调用、harness 执行、写回结果，这三步之间任何一处被打断，磁盘上就会留下一个「有调用没结果」的孤儿。下次加载这个会话、原样发给模型，API 要么拒绝、要么挂住。

自愈逻辑在每次从磁盘加载会话时跑一遍，而且是幂等的：

扫所有助手消息，收集它们发起的工具调用 ID；
往后找对应的工具结果；
哪个调用没有配对结果，就给它补一条合成的结果，内容标记为「已中断」；
顺手把结果顺序对齐到调用的声明顺序，并丢掉那些找不到对应调用的孤儿结果。

跑完这一遍，会话一定处在符合 API 配对要求、可以安全发出去的状态。这个机制看着不起眼，但它是「用户的对话不会因为一次崩溃就彻底卡死」的兜底。

几个回头看挺关键的决定

把这套东西串起来，有几个决定事后看价值很大。

用生成器做主接口。 流式和非流式共用一套逻辑，中间状态天然透得出来，UI 想画多细就画多细。这比「先实现非流式、再单独补一套流式」省掉了一整类不一致的 bug。

60% 就开始压缩，而不是等撑满。 给压缩本身（它也要调一次模型）留了余量，也避免在最后一刻手忙脚乱。

配对不变式贯穿始终。 从压缩的切割点、到落盘、到加载自愈，所有改动会话的地方都守着同一条规则。规则统一了，各处就不用各自发明各自的修补逻辑。

脏活集中在 Provider 层。 跨厂商的所有别扭——思考块、缓存键、能力差异——都摁在这一层消化掉，换来上面 runner 的干净。哪天要加一家新模型，改动基本不外溢。

小结

Orkas 的 harness 没有什么惊人的算法。它的价值在于把「让一个 agent 在真实环境里可靠地跑」这件事，拆成了一组边界清楚、各管一段的模块：runner 管循环和重试，工具管能力，Provider 管抹平多模型，记忆管检索，会话管持久化和自愈。每一块单看都不复杂，凑在一起才撑得住一个能天天用的东西。

真要总结点经验：运行循环做成流式生成器，中间状态好透出去得多；核心不变式（比如工具调用必须配对）一旦定了，就得在压缩、落盘、加载每一处都一致地守，别让任何一个角落例外；跨厂商的脏活越集中越好，渗进业务逻辑就再难收拾。还有一条最朴素的——默认你的进程会在最糟的时刻被杀掉，然后提前把自愈写好。

下一篇接着讲 Orkas 更有意思的一块：这个 agent 是怎么从自己的使用过程里学习、把经验沉淀成可复用的技能，慢慢把自己变得更好用的。

Agent 製品を出したことがある人なら、すぐ分かる感覚があります。デモを動かすのは速い。しかし、ユーザーが毎日信頼して使えるもの、しかも自分のマシン上で一日中走り続けても倒れないものにするのは難しい。難しいのはモデルにつなぐことではありません。モデルの周囲にある層全体です。

ここではその層を Agent Harness と呼びます。Harness は「大きなモデル」と「プロダクト機能」の間にある実行基盤です。ユーザーの一つの依頼を、モデルとの複数 round の会話に変え、その間に tool calls を挟み、結果を戻し、context が溢れる前に圧縮し、network の揺れを retry し、process が落ちても会話を回復します。モデルが考え、harness がその考えを信頼できる action の列へ落とします。

Orkas はユーザー自身のマシンで動く desktop agent app であり、harness も完全に client 側にあります。この記事では、この層をどう分け、run loop をどう動かし、tools と models をどう抽象化し、memory と sessions をどう扱っているかを説明します。

層の構造

Agent 製品を平らに並べると、おおよそ次の層になります。

┌─────────────────────────────────────────┐
│  Product features  (chat / skills / connectors / sync)  │
├─────────────────────────────────────────┤
│  Agent Harness  (run loop / tools / session)  │
├─────────────────────────────────────────┤
│  Provider abstraction  (many LLM vendors を統一)  │
├─────────────────────────────────────────┤
│  Infrastructure  (types / errors / logging / config)  │
└─────────────────────────────────────────┘

ここに大きな設計判断があります。モデル推論の呼び出しは client から直接行うということです。desktop app は thin client ではなく、harness 自体を持ち、モデル provider へ直接呼び出します。server は account、multi-device sync、billing を扱いますが、agent は走らせません。この判断により、sessions は local disk に置かれ、tools はユーザーの working directory を直接扱い、敏感なデータはマシンから出にくくなります。

Harness は runner、session、tools、Provider layer、memory に分かれます。順番に見ていきます。

Run loop: streaming generator

Harness の中心は runner です。一文で言えば、モデルが「終わった」と言うまで、何度もモデルと話すものです。

実装は async generator です。この選択は重要です。agent run は「request を送り、result を待つ」だけではありません。途中で model が token を出し、tool を呼び、tool が終わり、context が長くなって compaction が起き、network error で retry が入ります。generator なら、それらすべてを yield される event stream として UI へ出せます。

type AgentRunEvent =
  | { type: "text_delta"; text: string }
  | { type: "tool_start"; name: string; input: unknown }
  | { type: "tool_end"; name: string; result: string }
  | { type: "compaction"; tokensBefore: number; tokensAfter: number }
  | { type: "retry"; attempt: number; reason: string }
  | { type: "done"; result: AgentRunResult }

UI はこの stream を購読し、model output と tool execution をリアルタイムに描画します。非 streaming の入口も内部では stream を最後まで消費して done を取り出すだけなので、二つの code path がずれることはありません。

1 turn の中で起きること

一回の turn を展開すると、だいたいこうなります。

ユーザー message、画像があればそれも、session history に入れる。
system prompt を組み立て、available tools、skill index などを注入する。
model string を解析して具体的な Provider と model ID へ解決する。
tools をモデルが理解する定義へ変換し、history と一緒に送る。
model response stream を消費し、text token を出しながら tool calls を集める。
stream が終わったら stop reason を見る。

tool_use なら、tools を実行して結果を history に入れ、もう一度 model に聞く。
それ以外なら turn は終わり、result をまとめて done を返す。

ここで守るべき invariant があります。モデルが作った tool call には、history 上で必ず対応する tool result が直後に必要です。この対応が壊れると、次の model request は error になるか止まります。後で session self-healing のところで戻ってきます。

Tool call の routing

モデルは tool を自分で実行しません。「read_file をこの arguments で呼びたい」と言うだけです。runner がその intent を拾うと、tool を探し、working directory や cancellation signal を持った context を渡し、結果を session に tool-result message として追加します。

for (const call of toolUseBlocks) {
  yield { type: "tool_start", name: call.name, input: call.input };

  const tool = this.tools.get(call.name);
  const ctx = { workingDir, signal, state: { sandboxEnv } };
  const result = await tool.execute(call.input, ctx);

  session.addToolResult(call.id, result);

  yield { type: "tool_end", name: call.name, result: result.content };
}

Tools は順番に実行され、結果は model が宣言した順序で history へ戻ります。その結果を読んだ model は、次の tool を呼ぶことも、最終回答を返すこともできます。この「聞く → 呼ぶ → 結果を返す → もう一度聞く」loop が、多段タスクを実行する agent の土台です。

画像を返す tools には少し工夫が必要です。多くの model は tool-result channel に画像を受け取れません。Orkas は画像を tool result の後に別の user message として置きます。モデルはまず tool の text result を読み、次の message で対応する画像を見ます。Provider ごとの差を吸収するための小さな妥協です。

Context が溢れそうなとき

長いタスクで最もよく当たる壁は context window です。Orkas は満杯になるまで待ちません。各 tool round の後で token 使用量を見積もり、window の 60% を超えたら proactive に compaction を行います。

Compaction は、前半の会話を model に要約させ、その summary で古い messages を置き換え、最近の tail だけを残します。ただし、tail が orphan tool result から始まってはいけません。対応する call のない result が残ると、先ほどの pairing invariant を壊すからです。そこで cut point は clean boundary に合わせます。

なぜもっと細かい重要度 scoring や memory tree を使わないのか。理由は三つあります。第一に prompt cache です。多くの provider は prefix が変わらないほど cache が効きます。細かい compaction は history の中間を何度も書き換え、cache を壊します。60% まで触らず、一度だけまとめて圧縮する方が cache に優しい。

第二に complexity です。tool call と tool result の pairing は絶対に守る必要があります。細かく削るほど、この不変式をどこかで破る可能性が増えます。粗い summary なら clean cut を一つ守れば済みます。

第三に model の進化です。context window は年々大きくなり、long context の処理も良くなっています。精密な compaction algorithm に複雑さを背負わせるより、model に summary を任せる方が、model の進歩をそのまま享受できます。

token estimation では CJK も重要です。英語の感覚で中国語や日本語を見積もると大きく外れます。Orkas は CJK characters を別扱いにし、watermark が実態に近くなるようにしています。

Errors and retries

ユーザーのマシン上で外部 model API に依存する以上、error は例外ではなく日常です。runner は error をいくつかに分けます。

Retryable: rate limits、timeouts、connection drops、5xx。exponential backoff と jitter を使い、retry-after があれば尊重します。
Non-retryable: auth failures など、retry しても直らないものはすぐ error にします。
Special: context overflow。まず compaction を試し、その後一回 retry します。

Tool 自体の失敗は別扱いです。tool failure は model にとって情報であり、「この command は失敗した」と見れば別手段を試せます。Harness はこれを turn 全体の失敗としてただ止めるのではなく、flow に残し、後で stats にも反映します。

AbortSignal は要所ごとに確認します。ユーザーが stop を押したら、現在の turn はすぐ止まり、新しい retry は始めません。

Tool abstraction: 拡張しやすい薄さ

Tool interface は意図的に薄くしています。

interface AgentTool {
  readonly name: string;
  readonly description: string;
  readonly inputSchema: Record<string, unknown>;
  execute(input: Record<string, unknown>, ctx: ToolContext): Promise<ToolResult>;
}

tool は「名前、model 向け説明、input schema、execute function」です。read file、write file、shell command、web search などの built-ins もこの形です。desktop layer はその上に knowledge-base search、image generation、external connectors などを載せますが、runner から見ると同じです。

薄い interface の利点は、tool の出自を runner が気にしなくてよいことです。built-in、user-defined、skill から読み込んだもの、すべてが Map<string, AgentTool> に登録され、毎 turn model-readable definitions へ変換されます。

Shell command のような side effect を持つ tools は isolated executor を通します。timeout、output length limit、command blocklist、個別 environment などを持ち、process global environment を mutate しません。Electron のような multi-process architecture では、この分離が startup の安定性にも効きます。

Provider layer: 多数の model を一つの interface にする

ユーザーの model preference はばらばらです。製品は一社に固定できません。Orkas は harness の下に Provider abstraction を置き、各 vendor を共通 interface に揃えます。

interface LLMProvider {
  readonly id: string;
  complete(params: CompletionParams): Promise<CompletionResult>;
  stream(params: CompletionParams): AsyncIterable<StreamEvent>;
  validateAuth(): Promise<boolean>;
}

runner はこの interface だけを見ます。model string が provider/model なら明示的に分け、bare model name なら prefix などから provider を推定します。API key や OAuth token もこの層で扱い、期限切れ token は自動 refresh します。

難しいのは text completion そのものではなく、vendor ごとの意味の差です。reasoning model の thinking blocks は、vendor によって暗号化や再送ルールが違います。prompt cache の key 長制限も違います。こうした地味な差分を Provider layer に集めることで、上の runner は「model は一種類だけ」と見なせます。

Memory: 目的別の二つの仕組み

Orkas の memory は一つではありません。大量の資料を必要なときに探す retrieval-based knowledge base と、少量だが常に持っておく cross-session memory を分けています。

Knowledge base: hybrid retrieval

Knowledge base はユーザー文書、過去 notes、domain knowledge のように、大きいが毎回必要ではない情報を扱います。documents は line boundary で chunk に分けられ、keyword index と vector index の二系統へ入ります。

documents → line boundary で chunking（overlap あり） → dual indexing
                                          ├─ full-text index
                                          └─ vector index

retrieval は hybrid です。semantic に近い vector pass と、literal hits に強い keyword pass を行い、RRF で merge します。専有名詞や code symbols は vector だけだと落ちることがあり、逆に keyword だけでは言い換えを拾いにくい。両方を組み合わせるのが実用上の折衷です。

Cross-session memory

一方で、ユーザーの好み、役割、プロジェクトの決定事項のような少量で長期的に効く情報は、retrieval 任せにしません。毎 turn の system prompt に入る小さな memory として持ちます。ユーザープロファイルと事実メモに分け、長さ上限を置き、本当に長く役に立つことだけを残します。

書き込みは専用 memory tool が行います。何を記憶してよいか、何を記憶してはいけないかも tool description に明示します。さらに、memory は将来の system prompt に入るため、prompt injection や secret exfiltration を誘う文字列、不可視 Unicode などを保存前に scan します。便利な記憶が persistent attack surface にならないようにするためです。

Session: 壊れても回復できる

Session は message history を管理します。開発中はメモリ内配列でも十分ですが、ユーザーのマシンで動く本番では process がいつでも落ちると考えるべきです。そこで production session は local JSONL file に保存され、一行一 message になります。

新しい message は atomic append、compaction や clear のような全体書き換えは temporary file を書いてから atomic rename します。途中で電源が落ちても壊れた半端な記録を残しにくい形です。

最も重要なのが orphan tool call self-healing です。モデルが tool call を出し、harness が実行し、tool result を書く。そのどこかで process が落ちると、disk 上には「call はあるが result がない」状態が残ります。これを次回そのまま model に送ると API は拒否するか止まります。

Orkas は session load のたびに heal を行います。assistant messages から tool call IDs を集め、対応する results を探し、不足しているものには「中断された」という synthetic result を補います。さらに result の順序を call の宣言順に揃え、対応する call のない orphan result を捨てます。これで、session は model API の pairing 要件を満たす安全な状態へ戻ります。

大切だった判断

主 interface を generator にする。 streaming と non-streaming が同じ実装を共有し、中間状態も自然に UI へ出せます。

60% で compaction を始める。 最後の瞬間まで待たず、compaction 自身に必要な余白を残します。

pairing invariant を全域で守る。 compaction、persist、load self-healing のすべてが同じルールに従います。

vendor 差分は Provider layer に閉じ込める。 thinking blocks、cache keys、capability differences をそこへ集めることで、runner をきれいに保てます。

まとめ

Orkas の harness に派手な魔法はありません。価値は、現実の環境で agent を信頼して走らせるための境界を、地味にきちんと分けたことにあります。runner は loop と retry、tools は能力、Provider は多 model の吸収、memory は検索と個人文脈、session は永続化と self-healing を担当します。一つひとつは小さいですが、組み合わさると毎日使える agent runtime になります。

次の記事では、Orkas のさらに面白い層を扱います。agent が自分の使用履歴から学び、経験を reusable skills に変え、少しずつ使いやすくなっていく仕組みです。