Most AI assistants are "use it and forget it." Correct a habit today and it repeats the same mistake tomorrow; teach it your team's particular workflow last week and this week it acts like it never heard of it. Every conversation starts from zero — and however smart the model is, it's still a smart person with amnesia.
Orkas is going after something else: letting the agent learn from its own daily use, distilling recurring experience so it can apply it next time on its own. Put plainly — it gets more useful the more you use it, and that "usefulness" grows toward you, your preferences, your domain, rather than something the model vendor preset for everyone.
This article unpacks how that mechanism is built. It's not as simple as "make the model remember the conversation" — behind it is a complete loop: observe itself → decide whether to reflect → actually reflect → write the conclusions into something reusable → use it again next time. We'll go through it one piece at a time.
The most important thing first: everything below — all the "observing," "recording," and "reflecting" — happens entirely on your own device. Run data, skills, the agent's understanding of itself — all of it lives locally as ordinary files. None of it is uploaded to Orkas's servers, and none of it is used for cross-user analysis or model training. "Self-evolution" means a program reading its own run records locally and improving itself locally — not collecting your data. This experience never leaves the machine, and it serves only you, on this one machine.
The loop, end to end
real use, over and over
│ recorded locally: which tools were called, errors or not, corrected or not
▼
signals accumulate
│ signals extracted from the conversation in place, all kept on-device
▼
decide whether to reflect
│ weighted scoring across signals, fires only past a threshold; network hiccups don't count
▼
reflect in the background
│ not every turn — periodically, picking whichever agents qualify
▼
distilled into two things
│ ① reusable "skills" ② an "understanding" of itself
▼
carried in automatically next turn
└──────────► back to the top, keep rollingEvery step in this loop has its subtleties. The places easiest to get wrong are exactly the two steps that feel simplest at a glance: when to reflect, and what to record once you do. Let's start from the front.
Step 1: observing itself at nearly zero cost
To learn from experience you first need "experience" to look at. At the end of each agent run, the program counts out — locally, on the spot — a few lightweight facts about the run: roughly how many tools were called this turn, whether anything errored, whether it was a transient error (like a network issue) or a real one, and whether the user corrected it on the spot. Just a handful of counts and flags, all computed on the machine, calling no model and sent nowhere.
This matters because it costs nothing in model spend. These are counted straight from the current turn's conversation record; there's no need to make an extra model call just to "analyze itself." If every turn required another model call for introspection, the cost and latency would be unbearable and the whole mechanism would never ship.
The "corrected or not" bit is mildly interesting. It's a purely heuristic local judgment, estimated by matching a few phrasings in the message on your device — Chinese like "不对" / "应该是" / "重新," English like wrong, actually, instead. It doesn't aim to be precise — it's only a signal, not a verdict, and the occasional false positive is fine, because it later gets weighted together with other signals; nothing is decided on it alone.
Step 2: when is it actually worth reflecting
This is the part of the whole mechanism I think shows the most craft.
The naïve approach is "reflect once you've accumulated N occurrences." But that's crude: three network timeouts in a row and three user corrections in a row are obviously not the same thing and shouldn't be treated alike. Orkas uses weighted multi-signal scoring: each noteworthy phenomenon is a signal carrying a weight; sum the weights of the signals this turn triggered, and reflect only if the total clears a threshold (0.7 by default).
The main signals look roughly like this:
| Signal | Weight | Trigger condition |
|---|---|---|
| User correction | 0.9 | A user correction was detected this turn |
| Skill ineffective | 0.85 | A skill was loaded, yet the turn still errored |
| Recovered from error | 0.8 | Errored, but ultimately pulled it back |
| Hit a known weakness | 0.7 | The task hit a soft spot noted in the self-assessment |
| Task complexity | 0.5 | Tool-call count exceeded a certain number |
For example: a turn with both a user correction (0.9) and some complexity (0.5) sums to 1.4, well past 0.7, so it reflects; a turn that's only a bit complex (0.5) falls short and is let go. The weighting also reflects a judgment call — a direct user correction gets the highest weight, 0.9, because it's the highest signal-to-noise feedback there is: the user has plainly said you're wrong, so it's very likely worth recording.
That one critical exemption
In the whole scoring logic, there's one rule I consider the watershed for whether this mechanism "learns the right things": transient errors never count.
Network timeouts, dropped connections, rate limits — these are environmental problems, not deficiencies in the agent's own capability. Fail to exclude them and something bad happens: a tool errors because of one chance network hiccup, and the reflection mechanism records it as "this tool is unreliable, use it less" — or even mangles or deletes a perfectly good skill. From then on the agent has learned a wrong lesson, and that mistake follows it around.
So the "recovered from error," "skill ineffective," and "hit a known weakness" signals all explicitly keep pure transient errors out. The reflection prompt repeats the reminder too: network-class errors are environmental, don't record them as weaknesses, don't touch the related skills. What a self-improving system should fear most isn't learning slowly — it's learning in the wrong direction. This exemption is what guards against exactly that.
Step 3: reflection runs in the background, not in your face
An easy trap: the moment you detect "time to reflect," stop and reflect right there. That makes the agent feel like it stutters now and then, wandering off to "ponder life" — a bad experience.
Orkas moves reflection to the background, on a fixed cadence. The scheduling rules, roughly:
- Start a reflection cycle every so often (say, on the order of a dozen-plus hours).
- Enforce a minimum cooldown between two reflections for the same agent (a few hours), so it doesn't run too often.
- But if it hasn't reflected in too long (say, over a week), force one, so it doesn't drag on indefinitely.
- Cap the number of agents picked per cycle, so it doesn't spread too thin at once.
There's a small design I'm fond of called the dirty gate: when a cycle starts, first check whether this agent has anything new since its last reflection — any new signals, any updated conversation records. If there's no movement at all, skip it this time and don't waste a (model-costing) reflection. Simple, but it saves a lot in practice.
Step 4: how reflection actually works
When it's really time to reflect, the flow is: first organize the recent activity into a "packet," then pair it with a carefully written prompt and hand it to the model to read and summarize.
The packet has a budget: take at most a handful of recent conversations, add a few classes of system events, interleave them chronologically, and cap the total under a token limit (say, ten-thousand-plus). Not the whole history shoveled in — it wouldn't fit, and the signal-to-noise ratio would be poor.
What really takes care is the prompt. It requires the model to produce not "descriptions" but executable imperatives. The difference looks small and matters enormously. Compare:
✗ "The agent's output is sometimes too verbose; be mindful."
✓ "When answering family-office questions, never exceed 5 bullet points."
✗ "The user seems to prefer concise output."
✓ "When answering in a family-office context, always give the bottom line first, then the reasoning."
The prompt explicitly steers the model toward "never / always / when-then" structures with concrete trigger conditions. The reason is practical: a note that says "be mindful of being concise" tells the agent nothing actionable next time it reads it, whereas "never exceed 5 bullet points" can be followed directly. For self-improvement to be useful, what gets distilled has to be an instruction that lands — not a correct platitude.
After reflecting, the model can do a few things: create or modify a skill, update its understanding of itself, or — if there's genuinely nothing worth recording this window — just say "nothing to save." Letting it do nothing is itself an important design choice: don't force a learning outcome, lest you accumulate a pile of useless noise.
Distilled into two things
The output of reflection lands in two places.
One is skills. Each skill is a Markdown document with metadata — a frontmatter block recording the name, description, creation and update times, how many times it's been patched, and when it was last used — followed by the actual steps or key points:
---
name: "Weekly Report Export"
description: "Compile this week's data into the standard weekly-report format"
createdAt: "2025-01-01T00:00:00Z"
updatedAt: "2025-01-08T00:00:00Z"
patchCount: 2
lastUsedAt: "2025-01-09T10:00:00Z"
---
## Steps
1. ...
2. ...Storing skills as files is a pragmatic choice: a person can read them directly and edit them directly — not locked away in some opaque database.
The other is an understanding of itself. This part is more like a memo the agent writes to itself, in two pieces: one notes "what I'm good at and where I tend to trip up," the other notes "the plays I've worked out for this user and this domain." Both have a length cap, forcing them to stay concise — not longer is better, but truer is better. At the start of the next conversation, this content is injected into the system prompt, so the agent walks in with "an understanding of itself."
Skills aren't write-only
Just creating skills, and you accumulate a junkyard over time. So skills have a full lifecycle.
Beyond creation, the more common operation is actually patching: changing a small span in an existing skill rather than tearing it down and rewriting. Each patch bumps a counter and refreshes the update time. This lets a skill grow gradually with experience, instead of being rewritten wholesale at every turn.
There's a ceiling on count, too. The total number of skills is capped (say, 200); once full, adding a new one evicts an old one via LRU (least-recently-used) to make room. Eviction has a preference: kick out the ones never used since creation first — a skill that's never been read was probably never distilled right in the first place, and is better off making way.
Every time the agent reads a skill, its "last used time" refreshes. This timestamp both feeds LRU's eviction decision and lets the local mechanism tell which skills are genuinely in use and which are just taking up space.
How you know whether a skill is actually useful
This is the step many "auto-learning" systems lazily skip: it learned something — but is it any good? Orkas turns this into a few metrics, locally. These metrics are computed for the on-machine evolution mechanism's own use — deciding which skill to revise or delete — and likewise never leave this machine.
The mechanism: at the start of each turn, the available skills appear in the system prompt's index — that's one "impression"; if the agent actually reads a skill that turn, that's one "invocation." Compare the two and you get the first metric —
- Invocation rate = invocations / impressions. A skill that sits there day after day with no takers has a low invocation rate, meaning it's either useless or described so that no one can tell when to use it.
- Edit-after-hit rate = the share of times a skill was invoked but the user then edited the result by hand. High means what the skill produced isn't quite to the user's taste.
- Ineffective rate = the share of times a skill was invoked but the turn ended in a (non-transient) error. High suggests something may be wrong with the skill itself.
Here you see the shadow of that exemption again: when computing the ineffective rate, transient errors don't count, and neither do turns the user manually halted partway — you can't hang a black mark on a perfectly good skill over one network hiccup.
With these few numbers, skills go from "accumulating in a black box" to "something that can be evaluated and optimized." Which skill to revise or delete is no longer a gut call.
Closing the loop
Stringing the above together, one full cycle goes like this:
The agent works through real tasks, recording run data locally and marking signals in place as it goes. When the background reflection cycle comes due, it picks the agents that have new movement and have passed their cooldown, organizes each one's recent activity into a packet, and has the model review it against its current self-understanding — merging what should be merged, retiring what should be retired, distilling what should be distilled into new skills. The review's output becomes skills and self-understanding. On the next conversation, those skills go into the prompt index and the self-understanding goes into the system prompt, and the agent walks back in carrying what it learned last round. Then this round produces new metrics and signals, fed back to the start.
The loop rolls on, round after round. Not every round brings a dramatic leap, but the direction is one-way: toward understanding you better and repeating fewer of the same mistakes.
A few trade-offs worth naming
Looking back, a few decisions in this mechanism are key.
Introspection must be cheap. Observing itself uses zero-model-cost metrics; the genuinely expensive reflection is moved to the background, run infrequently, and gated by the dirty check first. Clamp down hard on the "expensive" part and the whole mechanism can actually run.
Better not to learn than to learn wrong. The transient-error exemption, allowing reflection to "save nothing," writing executable imperatives instead of vague descriptions — all point to the same judgment: for a self-improving system, learning in the wrong direction is far more dangerous than learning slowly.
What's learned must be visible, editable, and in your hands. Skills are plain-text files, self-understanding is a plain-text memo, skill effectiveness is checkable via metrics — and all these files sit on your own machine, not in the cloud. No black box anywhere; a human can open it and tweak it anytime.
Put brakes on learning. Count caps, LRU eviction, length limits — without these, "continuous learning" sooner or later becomes "continuous bloat." Forgetting, dropping, and pruning matter as much as remembering.
Wrapping up
Orkas's self-evolution is, at heart, adding a slow loop to the agent: the fast loop is the immediate response of each conversation; the slow loop is periodically looking back and distilling experience into something usable next time. The hard part isn't "making the model remember" — it's the easily-overlooked engineering judgments: how to tell which experiences are worth recording, how not to be thrown off by a chance failure, how to make what's learned genuinely executable, and how to prune it before it bloats.
Those judgments, taken together, turn "gets more useful the more you use it" from a marketing line into a mechanism that actually runs. An assistant that learns from you — and won't learn the wrong things — may be closer to what most people actually want than one that's merely smarter.