Anyone who has shipped an agent product knows the feeling: getting a demo working is fast, but turning it into something a user trusts every day — something that runs all day long on their own machine without falling over — is hard. And the hard part isn't wiring up the model. It's the whole layer that sits around the model.
That layer goes by a few names; the one I'll use is the Agent Harness. It sits between the "big model" and the "product features," and it's the real runtime: it turns a single user request into round after round of conversation with the model, threading tool calls in between, feeding results back, compacting context before it overflows, retrying through network hiccups, and recovering the conversation even after the process crashes. The model does the thinking; the harness makes that thinking land as a reliable sequence of actions.
Orkas is a desktop agent app that runs on the user's own machine, and its harness lives entirely on the client. This article walks through how that layer is built: how it's split into layers, what the run loop looks like, how tools and models are abstracted, and how memory and sessions are handled. Code details have been scrubbed and generalized, but the engineering structure is real.
The layers
Lay an agent product out flat and you get roughly these layers, stacked bottom to top:
┌─────────────────────────────────────────┐
│ Product features (chat / skills / connectors / sync) │
├─────────────────────────────────────────┤
│ Agent Harness (run loop / tools / session) │
├─────────────────────────────────────────┤
│ Provider abstraction (unify many LLM vendors) │
├─────────────────────────────────────────┤
│ Infrastructure (types / errors / logging / config) │
└─────────────────────────────────────────┘There's one consequential choice baked in here: all model inference happens on the client. The desktop app is not a thin client — it holds the harness itself and calls the model directly. The server only handles accounts, multi-device sync, and billing; it doesn't run the agent at all. This decision shaped almost everything downstream: sessions land on the local disk, tools operate directly on the user's working directory, and sensitive data never leaves the machine.
The harness itself breaks into a few pieces: the run loop (runner), the session, the tools, the Provider layer, and memory. Let's take them one at a time.
The run loop: a streaming generator
The heart of the harness is the runner. In one sentence, what it does is: talk to the model over and over until the model says "I'm done."
It's implemented as an async generator, and that choice matters. A single agent run is far more than "send a request, wait for a result." A lot happens in between — the model is emitting tokens, it wants to call a tool, the tool finished, the context got long enough to trigger compaction, the network failed and we're retrying. With callbacks or plain Promises, those intermediate states are hard to surface cleanly to the caller. As a generator, they all become a stream of yield-ed events:
type AgentRunEvent =
| { type: "text_delta"; text: string } // model emitting tokens
| { type: "tool_start"; name: string; input: unknown } // a tool starts executing
| { type: "tool_end"; name: string; result: string } // a tool finished
| { type: "compaction"; tokensBefore: number; tokensAfter: number } // context compacted
| { type: "retry"; attempt: number; reason: string } // error, retrying
| { type: "done"; result: AgentRunResult } // terminalThe UI subscribes to this event stream and paints the model's output and the tool execution in real time. The non-streaming entry point is internally just "consume the stream to the end, take the final done" — both entry points share one implementation, so there's no second code path to drift out of sync.
What happens inside a turn
Unrolled, one turn looks roughly like this:
- Push the user message (possibly with images) into the session history.
- Assemble the system prompt, injecting the currently available tools, skill index, and so on.
- Parse the model string and resolve it to a concrete Provider and model ID.
- Convert all tools into definitions the model understands, and send them out together with the history.
- Consume the model's response stream,
yield-ing text token by token while collecting any tool calls the model makes. - When the stream ends, look at the model's stop reason:
- If it's
tool_use, the model wants to call a tool — run the tools, then go back to step 5 and ask the model again. - Otherwise the turn is over — assemble the result,
yield done, and return.
There's one invariant you must hold here: every tool call the model makes must be immediately followed in the history by a matching tool result. The model API enforces this pairing hard — break it and the next request will either error out or just hang. We'll come back to this when we talk about session self-healing.
How a tool call gets routed back
The model doesn't execute tools itself; it only says "I'd like to call read_file with these arguments." Once the runner picks up that intent:
for (const call of toolUseBlocks) {
yield { type: "tool_start", name: call.name, input: call.input };
const tool = this.tools.get(call.name);
const ctx = { workingDir, signal, state: { sandboxEnv } };
const result = await tool.execute(call.input, ctx);
// append the result to the session as a tool-result message
session.addToolResult(call.id, result);
yield { type: "tool_end", name: call.name, result: result.content };
}Tools run sequentially, results are written back to the history in the order the model declared them, and then the model is asked again with those results in hand. Seeing the results, the model might call another tool or just give its final answer. This "ask → call → answer → ask again" loop is exactly what lets an agent complete multi-step tasks.
One detail deserves its own mention: some tools return images — screenshots, generated pictures. But many models don't accept images in the tool-result channel. Orkas handles this by splitting the image into a separate user message placed after the tool result — the model first reads "the tool returned this text," then on the very next turn sees the corresponding image. A small compromise that routes around capability differences between Providers.
What to do when context is about to overflow
The wall long tasks hit most often is the context window. Orkas doesn't wait until it's full — it sets a 60% watermark: after each round of tools, it estimates how much of the window the current tokens occupy, and once that passes 60% it proactively triggers compaction.
Compaction itself asks the model to summarize the earlier conversation, then replaces the old messages with that summary, keeping only the most recent tail. Sounds simple, but there's a trap: after the swap, the retained tail must not start with an "orphan tool result" — there can't be a "result with no matching call," or you've broken the pairing invariant again. So the compaction logic makes sure the cut lands on a clean boundary.
There's a more interesting choice worth unpacking here: why the coarse "summarize the whole block at 60%" approach, rather than something more fine-grained — scoring each message and trimming by importance, structured extraction from tool outputs, maintaining a layered memory tree? Those approaches look great in papers, but we deliberately didn't go that way, for three reasons.
First, caching. The model's prompt cache hits by prefix: as long as the history's prefix is unchanged, that span hits the cache, saving both money and latency. Fine-grained compaction constantly rewrites the middle of the history, which means repeatedly shattering the cached prefix — every edit forces a large re-prefill. The "leave it alone, then compact once at the watermark" strategy keeps the prefix stable across the vast majority of turns, with only that single compaction invalidating it. Far friendlier to the cache.
Second, complexity. That "every tool call must be paired" invariant we keep hammering on — the more finely you trim history, the more likely you break it in some corner. A coarse summary only has to protect one clean cut point; there are an order of magnitude fewer places to get it wrong. One fewer class of edge case is one fewer class of production incident.
Third, riding the dividend of better models. Context windows have grown steadily over the past couple of years, and models handle long context better and better. Pouring effort today into an elaborate compaction algorithm is essentially fighting a problem that's shrinking — odds are you finish tuning it just as the next generation doubles its window, and your complexity becomes pure liability. Conversely, handing the summarization to the model itself gets automatically better as the model improves: the better it is at picking out what matters, the higher the summary quality, and we don't change a line. Complexity the model can carry for you is complexity you shouldn't carry yourself.
Token estimation hides one easy-to-miss issue: Chinese. Estimate Chinese using English intuitions (roughly one token per few characters) and you'll badly undercount. Orkas weights CJK characters separately in its estimate; otherwise the watermark for an all-Chinese conversation reads wrong, and compaction won't fire when it should.
Errors and retries
Running on the user's machine and depending on an external model API, errors are the norm, not the exception. The runner sorts them into a few classes and treats each differently:
- Retryable: rate limits, timeouts, dropped connections, 5xx. Exponential backoff with jitter, capped at 30 seconds; if it's a rate limit and the server sent
retry-after, honor it. - Non-retryable: things like auth failures — no number of retries will help, so error out immediately.
- Special: context overflow. Try compaction first, retry once after, and only error out if that still fails.
There's one more class: "the tool itself failed." This doesn't sink the whole turn — a tool failure is itself information for the model, which, seeing "that command errored," can perfectly well try a different approach. The harness distinguishes these transient tool errors from real faults: it neither interrupts the flow nor loses them — they show up in after-the-fact stats. (That data later feeds the self-evolution mechanism, which is the subject of the next article.)
The external cancellation signal (AbortSignal) is checked at every key point. The user hits "stop," and the current turn halts immediately — no new retries get kicked off.
Tool abstraction: simple enough to extend
The tool interface is deliberately thin:
interface AgentTool {
readonly name: string;
readonly description: string; // shown to the model
readonly inputSchema: Record<string, unknown>; // JSON Schema to constrain inputs
execute(input: Record<string, unknown>, ctx: ToolContext): Promise<ToolResult>;
}A tool is just "a name + a description for the model + an input schema + an execute function." The built-ins — read file, write file, run a shell command, web search and fetch — all implement this interface. The desktop layer stacks a batch of locally-flavored tools on top (knowledge-base search, image generation, calling external connectors), but the interface is the same one.
The payoff of a thin interface is that where a tool comes from is irrelevant to the runner: built-in, user-defined, or loaded from a skill — they're all the same kind of thing, registered into one Map<string, AgentTool> and converted into model-readable definitions each turn.
Side-effecting tools like shell commands go through an isolated executor: timeouts, output-length limits, a command blocklist, and environment variables passed in separately rather than mutating the process's global environment — the latter would leak into a swarm of child processes and, in a multi-process architecture like Electron, can easily break startup.
The Provider layer: flattening many models into one interface
Users' model preferences are all over the map, and a product can't hard-bind to one vendor. Below the harness, Orkas lays down a Provider abstraction that unifies different vendors' models behind a single interface:
interface LLMProvider {
readonly id: string;
complete(params: CompletionParams): Promise<CompletionResult>;
stream(params: CompletionParams): AsyncIterable<StreamEvent>;
validateAuth(): Promise<boolean>;
}The runner above only ever talks to this interface; it has no idea which vendor is behind it. A registry handles routing by model string: an explicit provider/model form is split directly; a bare model name is attributed by prefix. Auth (an API key or OAuth token) is managed here too, and an expired OAuth token is refreshed automatically.
Flattening many models, the real headache isn't text completion — it's the corners where vendors' semantics disagree. Two examples that bit us.
One is preserving thinking blocks across vendors. Reasoning models emit a span of "thinking" content; some vendors encrypt it and require you to echo it back verbatim, others represent it with a different set of fields. If a user switches from vendor A to vendor B mid-conversation, the signature on that thinking span in the history no longer matches. The fix is to stamp every message in the history with "which model produced this," so the transform layer can decide whether to keep it verbatim: same model, keep it; different model, downgrade per the rules.
The other is the prompt cache. Across turns of one session the prefix is highly repetitive, and caching it saves meaningful cost and latency. The implementation passes the session ID as the cache key to vendors that support it, handling each vendor's key-length limits along the way (truncate or hash if it's too long, say).
These are all grunt work — but it's precisely this layer of grunt work that lets the runner above pretend "there's only one kind of model."
Memory: two mechanisms, each for its own job
"Memory" in Orkas is actually two parallel mechanisms solving two completely different problems. One is a retrieval-based knowledge base, for the bulk material you "go look up when needed." The other is cross-session memory, for the small set of key facts you "should always have in mind." Many products mash these two together; keeping them separate makes things much clearer.
Knowledge base: hybrid retrieval
The first mechanism targets content that's large but only occasionally relevant — the user's documents, past notes, domain knowledge. This is a local knowledge base with vector retrieval, in two backends: a lightweight pure-memory version (for tests and ephemeral use), and a version persisted to a local database (for production, with full-text indexing and vectors).
Data comes in along this path:
documents → chunk on line boundaries (with overlap) → dual indexing
├─ full-text index (keywords, no embedding cost)
└─ vector index (if an embedding model is configured)Chunks are cut on line boundaries with a little overlap between them, to avoid slicing a complete piece of meaning down the middle. Retrieval is hybrid: a vector pass (semantically close) and a keyword pass (literal hits), with the two result sets merged via RRF (Reciprocal Rank Fusion):
score = Σ 1 / (k + rank_i)The higher a result ranks within one pass, the more it contributes; summed across both passes, you both honor semantic relevance and don't lose exact literal matches. The vector and keyword weights are tunable, defaulting to favor semantics. After merging, results are deduplicated by "(document, start line)," keeping only the best one per location, then anything below a threshold is cut, returning the top-K.
Why not rely on vectors alone? Because vector retrieval often face-plants on proper nouns, code symbols, and exact literal strings — queries that aren't semantically special but where the literal matters a lot — while keyword-only can't catch "same meaning, different phrasing." Running both is a very practical trade-off between retrieval quality and cost.
Cross-session memory: keeping the user in mind
The knowledge base solves "too much material to hold." But there's another class of thing — tiny in volume, yet it must stay in mind at all times: who this user is, what they prefer, what was agreed last time. These shouldn't depend on retrieval to "get lucky and recall" — they should be present every single turn.
For this, Orkas builds a separate layer of cross-session memory, split by content into two parts:
- User profile: stable facts about the person — role, preferences, communication style, tech stack.
- Fact notes: durable facts about the work — decisions, milestones, project conventions.
Both are small, each with a hard cap of a few thousand characters, which forces them to keep only what's genuinely useful long-term. They don't go through retrieval; instead they're frozen directly into the system prompt at the start of every turn — meaning the agent simply "knows" these things, without having to remember to go look them up. That's exactly the opposite stance from the knowledge base: the knowledge base is "fetch only when needed, gone after," cross-session memory is "always present, always visible."
Writes go through a dedicated memory tool that the model calls when it judges, mid-conversation, that "this is worth remembering long-term," supporting add, substring-replace, and delete. What to save and what not to is spelled out clearly in the tool's description: user corrections and preferences are top priority; durable decisions and conventions get saved; while the current task's transient state, one-off debug info, and anything easily re-discoverable do not — memory is for "durable facts about the user and the project," not for "where I got to this time."
There's an easy-to-overlook but quite important detail: a security scan runs before every write. This content enters the system prompt verbatim and persists across sessions for a long time — effectively a durable injection surface. So every memory about to be written to disk is first scanned for suspicious patterns — classic prompt-injection phrasing ("ignore all previous instructions" and the like), commands trying to exfiltrate keys, invisible unicode characters hidden in the text — and a match is rejected outright. With deduplication and over-limit trimming on top, this memory layer stays useful without becoming a liability.
Together, the two mechanisms cover both ends — "huge but occasional" and "small but constant": the knowledge base handles the former, cross-session memory the latter. Add to that the agent's understanding of itself (the subject of the next article), and an Orkas agent walks in carrying three kinds of memory at once — about the material, about the user, and about itself.
Sessions: built to crash, built to heal
A session manages the message history. The basic version is just an in-memory array of messages with history trimming and compaction. But anything running on a user's machine has to assume it can be killed at any moment — the user quits the app, the system reboots, a watchdog timeout takes the process out. So production uses a persistent session, written to a local JSONL file, one message per line.
There are two write strategies: appending a new message uses an atomic append; anything that rewrites the whole file (compaction, clearing) uses "write a temp file + atomic rename." That way, even if power cuts mid-write, you never leave half of a corrupted record behind.
The most interesting piece is healing orphaned tool calls. Back to that pairing invariant: the model makes a tool call, the harness executes it, the result gets written back — interrupt any of those three steps and you leave an orphan on disk, "a call with no result." Load that session next time and send it as-is to the model, and the API will either reject it or hang.
The healing logic runs every time a session is loaded from disk, and it's idempotent:
- Scan all assistant messages and collect the tool-call IDs they made.
- Look forward for the matching tool results.
- For any call missing a matching result, synthesize one marked "interrupted."
- Along the way, align the result order to the call declaration order, and drop any orphan results that have no matching call.
After this pass, the session is guaranteed to be in a state that satisfies the API's pairing requirement and is safe to send. The mechanism looks unremarkable, but it's the safety net that keeps "a user's conversation doesn't lock up permanently just because of one crash."
A few decisions that mattered in hindsight
Stringing this all together, a few decisions look especially valuable after the fact.
Generators as the primary interface. Streaming and non-streaming share one implementation, intermediate state surfaces naturally, and the UI can paint as much detail as it wants. This saved a whole class of inconsistency bugs that "implement non-streaming first, bolt on streaming later" would have created.
Compact at 60%, not when full. It leaves headroom for compaction itself (which also costs a model call) and avoids scrambling at the last moment.
The pairing invariant runs through everything. From the compaction cut point, to writing to disk, to load-time healing, every place that touches the session holds the same rule. With one rule, no spot has to invent its own patch logic.
Grunt work concentrated in the Provider layer. All the cross-vendor awkwardness — thinking blocks, cache keys, capability differences — gets digested in this one layer, in exchange for a clean runner above. Add a new model vendor someday and the change barely spills out.
Wrapping up
Orkas's harness has no stunning algorithm. Its value is in taking "make an agent run reliably in a real environment" and splitting it into a set of modules with clean boundaries, each owning one slice: the runner owns the loop and retries, tools own capability, the Provider layer owns flattening many models, memory owns retrieval, the session owns persistence and healing. None of them is complex on its own; only together do they hold up something people use every day.
If there's anything to take away: make the run loop a streaming generator and intermediate state gets much easier to handle; once a core invariant is set (like "tool calls must be paired"), hold it consistently across compaction, disk writes, and loading — don't let any corner be the exception; concentrate cross-vendor grunt work in one layer and keep it out of business logic; and — most plainly of all — assume your process will be killed at the worst possible moment, and write the healing for that moment ahead of time.
The next article gets into a more interesting part of Orkas: how this agent learns from its own use, distills experience into reusable skills, and slowly makes itself more useful.