Every serious agent demo eventually hits the same wall.
You close the window.
The agent that looked brilliant five minutes ago wakes up with no idea what happened. It does not know which file it changed. It does not know which email was sent and which one was only drafted. It does not know whether you approved the last step or whether it merely proposed it. It remembers the conversation, maybe, if the platform kept enough chat history around.
But it does not remember the work.
That distinction matters.
A bigger context window does not fix this. It only gives the agent a larger pile of text to rummage through. Useful memory is not “more previous messages.” Useful memory is a work log.
What happened. When. By whom. With which file. After which approval. With which outcome. And where the proof lives.
Context is a pile
Context windows are useful. I am not pretending otherwise.
A large window lets an agent hold more of the current problem in its head. It can read more files. It can keep more of the conversation visible. It can avoid some of the dumb mistakes that happen when a model forgets the thing you said six turns ago.
Good.
But a context window is still a pile.
It mixes plans, guesses, commands, partial outputs, stale assumptions, old requirements, user corrections, and actual decisions into one long transcript. The model has to infer what matters every time. Sometimes it gets that right. Sometimes it confuses a suggestion with an action. Sometimes it treats a draft like a sent email. Sometimes it remembers the old plan more strongly than the fix you gave it ten minutes later.
That is fine for brainstorming.
It is not fine for operations.
If an agent is allowed to touch production code, send client communication, update a CV, apply for a contract, edit infrastructure, or publish content, “I think we discussed that earlier” is not good enough.
Memory is a ledger
A work log is boring on purpose.
It says:
- drafted, not sent
- sent, message id recorded
- attached this file, not that one
- follow up on Tuesday
- rate confirmed at 70 EUR/hour
- user approved before send
- reply preserved Gmail thread headers
- build passed before push
- deploy URL checked after publish
This is the stuff that makes an agent safe to use twice.
Not impressive. Useful.
The useful agent should be able to reconstruct its own work without guessing. It should not have to “feel” that something happened. It should have a record.
That is what separates an assistant from an operator.
An assistant can produce text. An operator can tell you what it did, what it did not do, what still needs approval, and what happens next.
The dangerous middle state
The most dangerous agent state is not failure.
Failure is visible. The command crashed. The test failed. The API returned 401. The browser did not load. Annoying, but at least the state is clear.
The dangerous state is almost done.
The email is drafted but not sent. The CV exists as PDF but not DOCX. The prospect folder was created but the follow-up date was not recorded. The code was patched but the test result was not saved. The plan says “done” but the repo was never pushed. The blog post exists locally but the image path points to a file that was never committed.
Humans lose work in this state too. Agents amplify it because they speak with confidence even when their state is incomplete.
A work log forces the system to distinguish between intention and execution.
There is a huge difference between these two lines:
Prepared reply to recruiter.
and:
Sent Gmail reply in thread 19e2ae0df83ed358, attached Sebastian_Schkudlara_CV_AI_Agent_Builder_Umicore_2026.docx, saved office record at prospects/groupmeerab-ai-agent-builder/communications/OUT_2026-05-15_word-format-cv.md, follow-up set to 2026-05-19.
The first one is vibes.
The second one is state.
The agent should be able to answer one question
“What exactly did you do?”
Not generally. Exactly.
A useful answer sounds like this:
I replied to the latest inbound message, preserved the existing Gmail thread, attached the Word-format CV, saved the sent record locally, updated the prospect status, and logged the follow-up date.
Even better, it includes the file paths and IDs.
The point is not bureaucracy. The point is that the next session can continue without a séance.
This becomes more important as agents start working across tools. A single task may touch Gmail, a local CV folder, a project CRM, a blog repo, a deployment pipeline, and a memory system. If the only record is “the chat probably knows,” you do not have automation. You have a fragile performance.
Local memory matters because memory needs inspection
I do not want my agent memory trapped inside a SaaS chat interface.
I want to grep it. Diff it. Back it up. Repair it. Audit it. Move it between tools. Use it from Codex, Claude, Gemini, OpenCode, or whatever terminal agent shows up next week.
That is why my own setup treats memory as infrastructure: journals, pages, search, local files, message IDs, paths, and operational state.
The agent does not remember because it sounds smart.
It remembers because the work left traces.
A local journal entry is not glamorous. Neither is a prospect meta.yaml, a sent-mail record, or a build log. But those are the parts that keep the next action grounded in reality.
The agent can search the Brain. It can inspect the repo. It can read the sent record. It can verify the attachment exists. It can compare the latest email with the previous one.
That is how you get continuity.
Not by praying the model has enough tokens left.
Work logs change how you design agents
Once you accept that memory is a ledger, agent design changes.
You stop asking, “How do I fit more context into the prompt?”
You start asking better questions:
- What should be recorded after every meaningful action?
- Which states need names?
- Which actions require explicit approval?
- Which file paths are the source of truth?
- Which outputs need verification before they count as done?
- What must the next session know without reading the whole chat?
That last question is the real test.
If tomorrow’s agent cannot continue the work without asking you to reconstruct yesterday’s decisions, yesterday’s agent did not finish the job.
The next agent benchmark is boring
The next useful agents will not win because they can swallow a million tokens.
They will win because they can keep a clean ledger.
They will know what changed. They will know what still needs approval. They will know the difference between drafted and sent. They will know which file was attached. They will know whether the build passed before the push.
That sounds less magical than an infinite context window.
Good.
Magic is not the goal. Reliability is.
The benchmark I care about is simple:
Can the agent come back tomorrow and tell me exactly where we left off?
If yes, it has memory.
If no, it has a bigger pile.
Sebastian Schkudlara