Jan 9, 2026 · 61 min read

Opinionated Mental Models for Agentic Workflows

Tool-agnostic mental models for the agentic era. What to delegate, how to design feedback loops, and why human judgement scales LLM output.

ai tech dev

Table of Contents

- Delegation Skill
- Footnotes

Two years ago, precisely before Christmas Day, I wrote about using LLM for software development to leverage our shipping speed and capabilities. Fast forward, January 2026, now it’s time for reflection (and crazier writing hopefully) especially about what I wrote about AI-assisted software development that completely changed. Some of them including my opinion, method, and tools are pretty irrelevant today. My opinion about how we integrate into our current “engineering” workflow also changed a lot. The model capabilities and current AI tools available right now also changed a lot for LESS THAN A YEAR.

As stated in my previous blog, I had been using cursor as my go-to IDE for AI-assisted programming last year. I also mentioned bolt, lovable, and v0 for prototyping web UI too. This year, I have completely changed how I write code and use AI compared to last year (oops, less than a year actually). Most of my workflows are now in terminal, even writing code itself (neovim btw).

I feel like what I wrote in my blog about previous years has aged like milk and some of them feel very wrong currently especially AI-powered development and sigh.. about crypto ai agents. I wrote about specific tools, specific prompts, and specific workflows or methods, some of them are outdated (I won’t mention any, you may know it).

These agentic workflows, vibe coding, AI-assisted engineering, context engineering or whatever it is called still feels early to me in terms of real real serious maturity despite people starting to catch up to the capabilities of current SOTA models. While some argue that it is completely bullshit. Some companies and the enterprises started to adapt to the current AI paradigm, giving their employees subscriptions to AI tools such as cursor, claude code, and ~~not the worthy one which is~~ copilot. Some people glaze too much about the latest models and tools capabilities (honorable mentions, opus 4.5 with claude code harness).

Based on my personal observation, with current state of AI-powered coding and workflows, we can separate types of individuals who use AI as:

Overhype Crowds, basically overglazing and hyping too much every time there is a latest model release like “this is AGI, it’s over, billions of programmers will be replaced in less than a year”. Indeed they acknowledge the capabilities of AI but it’s kind of hyperbole that at the end it will result in fearmongering to the masses and overstate something that could be huge misinformation at the end.
Shiny new things, try every newest tool and model available, very chronically online about latest AI capabilities. Keep switching tools over time in short periods. This is probably somehow myself unironically as I changed tools a few times from cursor -> claude code -> codex -> back to claude code + opencode. This type of person can be divided again as FOMO, braindead following shiny new things or use it to follow current LLM capabilities, stay true to themselves and able to filter what’s actually useful (higher signal, less noise).
People who are only able to use specific tools due to company strictness. I have seen people (and my friends) that are only able to use copilot or internal tools due to other AI tools not being allowed to be used, even blocked by company internal network. Regarding AI frontier labs, I don’t know if direct big competitors—OpenAI, ~~Meta~~, Anthropic, and Google DeepMind—are limited to use their own tools for work or allowed to use their competitor’s tools for their work or not but I assume that they are only allowed to use their internal tools and their own models for work.
Denial stage, people in denial about current LLM capabilities, what current SOTA models are able to accomplish within less than a year, even anti-AI. Not really open about what’s going on. Some people may still think that current LLMs generate absolute bad code most of the time and hallucinate a lot. Not able to generate high quality production code. Not embracing what’s really going on.
Don’t know what is really going on.. at all. Worse than denial stage. Maybe all they know is hey you wasted three gallons of water for generating these prediction tokens. Don’t know what LLMs have accomplished, their capabilities, etc. Clueless about how to use LLM other than asking LLM for helping with their simple assignments whether it’s for school tasks or daily life ones.
Leveraging LLM to the max. Built their own workflow and system to enhance LLM capabilities as much as possible. This type of people can be divided again because I have seen it not only for coding and programming wise but also creative works (I’m not talking about endless AI slop videos generator one) and other domain works (ex: SEO, marketing/gtm, finance, biz, etc.).

What I mean by “leveraging LLM to the max” here is the ability to understand what an LLM is, its strengths and weaknesses, then able to pull out LLM strength and its possibility as much as possible out of current models while trying to reduce its weakness. We know that LLM has limited context length thus there is a term we call today as context engineering, the ability to create and manage context to get the best model output. This includes prompt, instructions, documents, external data, tool calls, and memory/history ¹. It is hard to articulate this type clearly but to make it clear, I describe this kind of person as they know what they are doing.

Back to the topic, fast forward the tools are moving fast while the frontier labs keep competing each other to release their best models as fast as possible ~~with its unnecessary benchmarkmaxxing~~. Claude code today maybe won’t be the same claude code in a year. Imagine opus 5, gemini pro 4, and gpt-6. Codex, kilocode, cursor, whatever comes next, I believe that the interface, how we interact (ux & dx) will change, the capabilities will expand.

This writing will be dedicated to what I think that hopefully won’t change, completely tool-agnostic, the underlying mental models and the thinking process to interact with the future machine god.

what to delegate, what not to delegate
human feedback loops design
context structure
when you intervene vs let it run ~~—dangerously-skip-permissions~~

TLDR

Human judgement scales LLM output — the more domain expertise and taste you have, the better code the model produces
Delegation is a skill — knowing what to delegate, how much context to transfer, and when to take back control
Feedback loops as infrastructure — design tight, automated verification cycles (ex: ralph loop pattern) so the agent self-corrects
Verification > Trust — type systems, tests, linters are your safety net; never trust agent output blindly
Declarative systems win — reproducible, sandboxed environments (like nix) are ideal agentic playgrounds

2025 Wrapped, Enter the TUI-morphism

screenshot of my desktop

Let’s do some quick recap of 2025, what has been happening in my own perspective about AI-assisted coding & programming.

All started from when claude code was released, around February 2025. Instead of building AI-powered IDE or IDE extension, The Anthropic team were approaching differently for their own new product, not GUI app but TUI-based app. You write claude in your terminal and it’s ready to assist whatever you need through terminal workflow. It can interact directly with your terminal environment such as unix commands (ls, ll, cd, mkdir, grep, etc). Maybe it sounds quite similar to aider but actually different as both tools have different approaches for users/developers experience for their tools.

Then there were a bunch of TUI-based AI tools that spawned. OpenAI made codex, another agentic coding tool similar to claude code but it’s written in rust ~~so must be blazingly fast~~ and open-source unlike claude code. Google and Alibaba made their own coding tools too, gemini-cli and qwen-code.

While those frontier labs made their opinionated coding tool to serve their own model, there is also open-source agentic tool similar to what claude code is but you can switch freely to use other models as well easily. Software company that made sst, Anomaly, developed their own open-source agentic tool called opencode this year (around May-June) so users are able to use and switch all models available including using existing claude plan, using openrouter, and specific model providers as inferences while they’re also able to customize and configure their tool for their own workflow. Even better, as it is open-source project, we can contribute and open PR directly to the repo. Charmbracelet, the “pretty CLI & TUI” software company, also made their own agentic coding tool similar to opencode. Both company and their open-source tool use their own TUI framework. Charm used their existing bubbly cutesy glamorous concept TUI-framework based on golang, bubble tea while Anomaly developed their own TUI framework this year which is opentui this year written with mix of typescript and zig.

CLI.. Why???

Maybe some of you even wonder, why most of the tools are available as CLI or TUI-based instead of ~~high ram usage electron~~ GUI-based apps. I saw that some people even migrating back to cursor/vscode rather than using cli-based apps like claude code. Personally, I love cli-based more ~~so I can larp as performative hacker looking guy when I was working in an office or cafe~~ because of flexibility, customization, and control I have rather than standalone well-designed apps. I can feel the touch between me and the computer more closely as it reduces abstraction layer disguised as “user-friendliness” when I want to interact with my computer. CLI-based app also (supposed to be) uses less resources, way lower than most of desktop and browser apps ~~especially electron-wrapper or chrome-fork~~.

As CLI-based coding agents are editor-agnostic, you can plug-in whatever IDE or text editor you are using. You can open claude code on cursor, vscode, or zed as it can be spawned directly by typing “claude” on terminal. You can also split it as vertical or horizontal panel on your favorite panel, one is opening claude code, the other panel is opening vim or emacs.

It feels nice that LLM can interact with our terminal directly using their tool calling to our UNIX system directly such as grep and ls. Especially if our project directory has a bunch of dependencies installed like uv and bun, LLMs can interact and test directly whenever need confirmation, or do some git operations..

The most obvious advantage of those coding agents being CLI-based rather than desktop app are the ability to customize further beyond your computer. You can spin up new VM, install claude code in there, and customize further your development environment in your own VM. Then you can setup your own network/internal VPN like tailscale to connect between your own device (mac and iphone for example) to your VM. Recently, I often SSH to my own VM to spin-up claude code and opencode as background worker that i can leave the task for coding. With app being terminal-based, not only we use lower resource but we eliminate more interactivity abstraction layer for more flexibility like this. You can also setup tmux or screen as multiplexer to make your own async workflow and enhance further with tiling window manager to adjust between your terminal with other apps especially for workspaces switching and window splitting (on macOS I’m using aerospace while on my desktop, I’m using hyprland).

Changed The Way I Work

With these new agentic coding tools, I have changed the way I’m doing programming and writing code by myself. Quick lore recap, previously as biomedical engineering undergraduate, I used vscode as default IDE most of the time and jupyter notebook to write and test code that related to data & ML. Then in 2024, I changed my go-to editor from vscode to cursor as I’m addicted to cursor tab (the best tab completion engine I’ve ever used so far) and having better user experience to interact with LLM than copilot from vscode.

I have explored a lot of new tools whether it’s existing tools or tools that released this year. Example of tools and something that I’ve discovered and used in 2025 that I would classify as game-changing personally: tailscale, nix, tmux, and agentic tools I mentioned above (obviously).

Now, in 2026, I comfortably write most of the code with neovim. I also write my blog on neovim as text editor with localhost server of astro website shown in my brave browser, splitted by my own aerospace that configured with my nix dotfiles on top of git worktree that separated between main branch worktree, feature implementation worktree, and writing draft worktree. I thought that learning vim keymaps and configuring your own neovim for coding are productive slop to justify your coding skill and performance but turned out I was wrong. I found it really useful and makes my programming faster including for debugging, analyzing codebase, searching specific file & code, and writing code. Using neovim also makes my work related to ssh-ing to server easier as I can just write using vim on server directly rather than nano which I found annoying in the first place to be with. Well, I don’t memorize all of vim keymaps, mostly what I use most are simple keymaps such as :wq, !wq, :ggVG, :/, :g, and :<line-number> (well actually way more than that).

Somehow Nix is Perfect DevOps Tool to Vibe with

This one is seriously personal perspective, I don’t really care if you disagree with me but I found that nix is a perfect programming language to vibecode with related to manage packages, build, and devops. Nature of nix being purely functional and declarative feels perfect for prototyping, vibecoding, and agentic development. I have been exploring what nix and opus are capable of and it was a really amazing experience. Not only you write code for your own development environment as declarative configuration but also the multiple ways to configure your agent environment. You can use nix for sandbox-ing your claude, codex, or opencode with multiple design patterns while sticking to nix core philosophy.

As of today, I have been experimenting with making customized nixos machine with clan framework to build dedicated VM to let agent run wild within sandbox, with nix as one of main code infrastructure here.

geoff@GeoffreyHuntley

if you control your supply chain from source then you can customise any software and mould it to your whims nix gives you this power https://t.co/IoJ9ekooXZ

I also wrote a blog about learning nix you can read here.

My Current Setup

As I changed the way I work, most of the CLI tools especially agentic tools are moved to my own personal VM (derived from my baremetal server) as work station that you can see on my about page. I have been thinking of building and setting up my own dev vm so I can customize my dev environment to the max and there is clear separation between my own mac with actual code plumbing and testing environment. A tweet from @elliotarledge about having separate rig as work station really articulates well what I have been thinking of here:

Elliot Arledge@elliotarledge

im too tired for an explainer video so just take this sorry you dont get nice fancy RGB lights to look at :( --- ram for claude codes cheaper ssds nvidia cuda supported gpu not battery limited can use zerotier/tailscale to remote in from anywhere can reproduce/conduct research at

As the models and these agentic tools become too good now, I’m also planning to make my own dev vm to be more agent-friendly so claude is able to do some sysadmin work, committing and making some PR, opening issues, doing work in background, etc. I’m experimenting with all the capabilities to fully-maximize “agent” as worker machine similar to mining machine in factorio for task delegation that might be connected with cable-like node or network for my soon-to-be-automated workflow. As someone who has played factorio and done some redstone-engineering on minecraft, you know that the end game is to automate your work as much as possible, maximize your efficiency.

The Foundational Shift

Let’s play a game of assumption, we want to take away the joy of writing code manually and we want to focus on building part and automate coding as much as possible. Coding is fully-delegated to agentic tool, we may call it as “X”

Now as we assume that writing code is fully-automated and delegated to AI, how do we know we achieve what we want? We verify the output of code written by AI and the result of it.

Does it compile?

Is this the output what we want to achieve?

Is the process from start to end right?

Does it return any bugs or errors? If yes, why is it? What’s the root cause?

Yes we are not writing code anymore but we still need to know WHAT IS GOING ON and VERIFY THE OUTPUT. If possible, the entire process and its written code. We eliminate coding yet we have to understand the code itself. Understanding code means we should be able to read the code itself. Therefore, having the ability to read, understand, and differentiate the quality of code written by the model is important.

I call it as human judgement, the ability to judge code written by the model. The more experienced (senior-level) and the better you write code as an engineer, we can assume that the better your human judgement skills are. You know what’s going on and able to navigate quickly what went wrong on the code. As a consequence, we made a paradox. The better you write the code, the better your code reading ability too.

So, is there ~~symlink~~ correlation between code writing skill and code reading skill?

Before I try to answer that simple question. I want to add disclaimer that I’m bad at writing code, and probably still am to this day as I’m still learning how to write proper good code, best implementation, and still not able to solve all easy leetcode challenges. But the more I learn and catch up with what I don’t know about programming, compute, networking, or whatever domain knowledge I need to know correctly, the better the code output written by the model. More knowledge of web development made my instructions, contexts, and prompts curated more clearly to the model. As a result, the code output is more clear and better than when I don’t really know what is really going on and can’t judge what’s the best way or approach in specific moment.

correlation graph between user and llm output

Here is fully-made up theoretical correlation graph between llm output and user “x” factor that I’m trying to model here. The “x” factor including the human (user) judgement, the human taste, the right contexts gathered by the user, user prompt, and the user domain expertise.

Hence, my answer is more domain expertise, better ability to design the system and architecture, the better your human judgement and your taste is -> the better the code written by the model (also at least able to read code and know what’s going on too). Maybe the model itself knows everything, but I believe that the model output is scaled based on the human intelligence, taste, and knowledge. The more intelligence and better domain expertise the human is, the better model output and align with the human itself. I believe that human judgement and taste really matter the most in the current situation and in the future (I don’t know until when). Okay, now we go back to the previous term that related to this which is once again, context engineering.

An important note: as future model capabilities and intelligence might be increasing hypergrowth, my stupid hypothetical graph between user “x” factor and llm output quality might be irrelevant in the future. Maybe in the future you can just prompt “pls make this app make no mistake” and it really does without making any mistakes at all, production-grade, scalable, very-secure, with highest quality codebase that you can imagine.

Delegation Skill

I believe that the skill to know what to delegate or what not to delegate might be the most important skill here. Delegation isn’t about make AI do stuff only. It’s more about the balancing as:

Under-specified delegation → garbage output, wasted cycles
Over-specified delegation → you might as well have done it yourself
Wrong-task delegation → some things aren’t delegatable (yet)

Delegation derived from context engineering, know what to handoff and how much context we can transfer to an LLM. Last, know when to take control. Delegation here also means that you do not delegate your thinking into ai, you delegate the task, share relevant knowledge & context, and instruction.

delegation example

The Mental Models

The Sci-Fi Lens

Ironman meme

If you were watching Ironman (or any marvel movies that has Tony Stark in it) or recent Superman movie character, Mr. Terrific, you may be familiar with Jarvis, the AI companion computer system of Tony Stark and T-Sphere, the ~~agentic ai balls~~ spherical electronic accessories used by Mr. Terrific.

Both are similar, used and viewed as tool by the superheroes I mentioned above while the tool shown in the movies are similar to LLM.

The t-spheres respond to Mr. Terrific’s mental and vocal commands. The spheres are then able to do what he is commanding such as self-propelled flight, creating holograms, manipulating other electronic devices, bombs, etc. Jarvis is giving information and what Tony Stark needs on demand based on his commands.

The keywords here are command which we can model it as an input to the tool itself whether through text, voice, or programmatic ways. We ask and instruct with natural language as a prompt to the model. The model then returns the output based on the input tokens. Current LLMs are also able to act by calling function and executing tool defined by the user or from the model itself.

To make it clearer, here is the idea as a pseudocode for tool-calling of Mr. Terrific’s t-sphere:

    {
        "name": "information_query",
        "description": "Query databases and provide information (JARVIS-like)",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Information request"
                },
                "sources": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "Databases to query: [internet, criminal_db, satellite, local_sensors]"
                },
                "urgency": {
                    "type": "string",
                    "enum": ["immediate", "thorough"],
                    "description": "Speed vs depth tradeoff"
                }
            },
            "required": ["query"]
        }
    },

    {
        "name": "hack_system",
        "description": "Infiltrate and manipulate electronic systems",
        "parameters": {
            "type": "object",
            "properties": {
                "target_system": {
                    "type": "string",
                    "description": "System to hack (door, camera, network, vehicle)"
                },
                "action": {
                    "type": "string",
                    "enum": ["bypass", "disable", "take_control", "extract_data", "inject_payload"],
                    "description": "Hacking action to perform"
                },
                "stealth_mode": {
                    "type": "boolean",
                    "description": "Whether to hide intrusion traces"
                }
            },
            "required": ["target_system", "action"]
        }
    },

Above, we defined tool definition for information_query and hack_system for t-sphere. Then we can inject these tools we defined in the system prompt of the t-sphere devices as long as the “language model” inside it supports tool-call. Once again this is purely fiction as we are trying to see what really LLM capable of in the eye of sci-fi and its endless potential capabilities in the future.

Why are we using Ironman and Mr. Terrific as example here?

Both have in common of using natural language as primary interface. Also use Jarvis and t-sphere as a tool instead of standalone powerful agent. The human, Tony Stark and Michael Holt make the decision then the tools either execute, assist, or inform based on the human instruction/command.

Current open-source SOTA LLMs such as GLM 4.7, MiniMax 2.1, and Kimi K2 can be run locally as long as you have big racks of compute to run it. Assume that in the future, open-source models will be much smaller in size, same performance or maybe a bit better than GLM 4.7 as an example, can be run with less than 48GB VRAM or maybe even on your phone while context windows are expanding and the voice interfaces to interact with LLMs are enhanced. There could be more wild things you can do and achieve. Even with current models, you can connect to your favorite agentic tools and vibecode with your phone using your server/desktop as the medium via termius and tailscale.

I believe that in the future, only a matter of time, there will be decent IoT-like hardware with LLM locally run inside and able to execute action through processing LLM response blended with function-calling programmed inside the device.

We are really close to having jarvis at home (bet it’s less than a decade). There is a prime example of my x oomfie that I found quite creative, he leveraged his domain expertise as devops and utilizing claude code opus as his 24/7 oncall engineer/sysadmin as his experiment. You can read his substack here.

The Gaming Lens

redstone enginnering, minecraft

Previously, I mentioned factorio and redstone engineering in minecraft. This is my longer term mental model for computer-related work even pre-LLM era. Aim for

less work, focus on the output quality and quantities instead

Screw hustle culture, work 9-5, working late till night working like the way we did before, we should automate the boring part, repetitive part, or the part we don’t like as much as we can as we measure the result of the work, the output at the end of the day. Why would we want to work longer if the output is same quality as longer work? Time is more important currency than our work, we should prioritize our free time more than doing unnecessary work that doesn’t really matter. Focus on work that matters and the output itself then followed by the process of it to achieve the output. I wrote about automating yourself that influenced my thought here.

Why are we using factorio and minecraft (redstone engineering) as examples here?

Let’s say that the “coding” part in programming will be automated. Fully-generated by LLM with strong verification process. We can assume that our task will be more of managing the process of it like some factorio factory, also we design the automated software development system to make sure code generated by LLM itself is high quality coding output, the verification and its testing process should be strongly reliable to make sure its production-grade code, catch all the bugs before deployed to prod as much as possible especially crucial ones to serve the program/apps to end users properly.

Sergey Karayev@sergeykarayev

Claude Code with Opus 4.5 is a watershed moment, moving software creation from an artisanal, craftsman activity to a true industrial process. It’s the Gutenberg press. The sewing machine. The photo camera.

In factorio, first you land on the alien planet as a human. You mine all the resources you need with your hand using a pickaxe to build what you need such as mining drill, reactor, steam engine, etc. Then once you successfully built the foundational production power and resource-extracting machine, we started to build the circuit network and the transport to connect each factory building and machine to scale your factory. The basic concept of this probably can be viewed as you start writing small amounts of code or just vibecoded the basic function of it, then gradually scale it to make it even powerful, feature-rich, and able to serve more traffics.

Previously, we wrote code by our hand, manually, to make sure the code is good enough (high quality code) and reliable. Ensured that the code was well-tested and thoroughly debugged to minimize errors and ensure stability. Then we found several resources and tools to help us automate the process of writing code, baremetal server, GPU, LLM api, etc. Then we use the existing tools and resources available or maybe make our own to make software as easy as possible with our system, whether from coding, testing, feature design, and deployment process.

The process of what we are doing in factorio is very similar to system design and architectural choice concepts. We have to design and think systematically, if our factory architectural design is bad, there would be consequences waiting for you. Hard to maintain, really ugly and unorganized, hard to fix (aka debugging), etc. Thus, cascading effect is created.

Early game of factorio, everything works easily, no need to think about transport system, make it like spaghetti line is fine as long as it works. No consequences yet (for now).

Mid game, you want to scale your factory but suddenly it hit several limitations and bottlenecks. Adding more iron production breaks the copper workflow, belts are tangled, hard to trace the transportation line we designed. At the end, if your factorio empire (aka late game) already big enough, you want to make the material flow predictable, train networks are well-made for logistics side, and your circuit networks to have clear programmatic control and conditional logic. Hence, in factorio, it was about designing systems that are easy to maintain and control with and scale without you (intervening with it too much).

Similar to factorio, there are plenty programmatic mechanics you can leverage on minecraft using redstone as the main ingredient. Automatic farming system for waterflow on our farm and harvesting cronjob for our crops. Minecraft base defense system to combat zombies and skeletons automatically without your intervention. Organizing your inventory items automatically via connected programmed redstone on your chest inventories. Building secret door with piston via lever that connected using redstone for transport system efficiency (usually for long-distance travel such as cave exploring). The list goes on. All you need is knowing how redstone functions work and your blending skill between engineering and creativity works.

We can map out these video games example for our software engineering workflow with the help of current SOTA models. Fully-maximize the potential of LLMs without losing your creativity and thinking process. There is actually more video games that I could list here like terraria, but I think factorio and minecraft are enough and fitting examples here to share my idea about the mental models to be working with in present and the future of AI (no idea what kind of new innovation that will occur tbh).

Fundamental is More Important than Ever

Do you think with an LLM able to write all of the code means that you can develop fully-functional production-grade complex software? Are we vibing without knowing what is git, concurrency, time complexity (o(n), o(log n), etc), caching, an array, and data structures basic? (okay about time complexity, it’s not that important if we don’t need to think about performance or the user only ourselves)

Are we able to make secure web app by ourselves if we don’t even know how to store secrets such as an API key on .env? What if we git add . & git commit -m "lezgo" without .gitignore our .env with our precious API keys there?

Here is one of the lamest boring jokes that I actually hate right now because I have seen it multiple times but it’s the only joke I can think of regarding our current section:

//////////////////////////////////////@fuckcomputer

claude code is fucking insane i know literally NOTHING about coding. ZERO. and i just built a fully functioning web app in minutes http://localhost:3000/ check it out

Yup, that’s right knowing the fundamental of programming, literal computer literate, know how to use git for version control, how API works, or fundamentally how computer and internet work. Know what http is, understand and familiar with data structures and algorithms, one of the core courses in computer science.

My point here is, yes you can vibecode all you want but please learn the fundamental. Use the LLM to augment your knowledge and also fulfill the knowledge gap too.

You don’t want to ruin your viral consumer app by leaking users’ data, including users’ selfies and photo ID like Tea, a dating advice app for women that went viral on Tiktok ².

We still need to learn, not following LLM blindly even though current models are already good enough for yolo mode. Remember what I said before in previous section, the more knowledgeable we are, the better LLM output we can get. Even if it’s completely made-up hypothesis by myself, I believe learning (also strengthening) the fundamentals, and knowledge accumulation despite model intelligence that keeps improving will still benefit us in the long run. Especially our thought process and how we use LLM as a tool later on.

Here are books that keep appearing in discussions about programming fundamentals:

There is also recommended reading for developers from the cofounder of stackoverflow itself, Jeff Atwood, that you can check here.

I won’t limit the fundamental in this context for computer science and software engineering literacy. But it dives deep to what you are working on such as react framework, golang, and any frameworks or programming languages you are using. I will use my past skill issue (actually till now) experience of working with react as an example:

Around 2024-2025, I had a case I’m working on edu startup as >part-time, working on authentication (especially user auth >schema validation) and making new user onboarding flow and >its modal. I had specific issue regarding react (I >forget what it is), but what I remember was I’m repeatedly >asking on cursor (claude sonnet 3.5(?)) about the problem and >its solution. Claude provided 4 solutions to solve the >problem I encountered and I was giving up, decided to ask my >friend that understands react instead. My friend proposed the >solution that I was looking for, tried to implement the >solution (with help of cursor too) and finally it worked.

This is one example of how important the fundamental is, not limited to “programming” only, basically on the scope of what you’re working on too. Of course, until now I have encountered several issues I have faced that I have to search manually through stackoverflow, reading the official docs, and asking my friends for help.

Practical Primitives

After we redefine the thinking process and mental models of approaching and working with these token-prediction slot machines, now let’s talk about the practical methods to work with these models that hopefully still relevant in the future as we are in transition era as AI disrupts current tech industry (or maybe a bubble, honestly idk man).

Context Engineering

required context wojak

Context engineering is broader than prompt engineering. Context engineering means that we hand the curated relevant context, including prompts, specific instructions, specific way, and relevant data in the limited context length of the model as we haven’t solved the memory problem in current model. Using MCP might overload the context window and consume additional tokens ³. We want to hand the context to the model with flexibility to be able to try everything ¹.

MCP, agent skill, AGENTS.md, and tool calling are one of the ways to feed LLM proper and curated right amounts of context, not too much, enough to pull out the model output as much as possible by feeding an LLM precise right amounts of data, approach and instructions. Agent skill? It’s just markdown and script under the hood. RAG? Feeding right amounts of external data to make an LLM output more accurate, hallucinate less, and align with what we want.

There are multiple practical ways I could think of to do context engineering from the simplest form one. For example:

Well-made CLAUDE.md or AGENTS.md as agent entrypoint (global and project-scope)
Explicit constraints (ex: always use bun not npm, don’t use type = Any always define types)
Dump right relevant amounts of docs on docs/ then mention @docs/<relevant-docs per feature implementation
Breaking down big complex task into small subset tasks (modularization), no context overload. You may make the repo from the minimal/bare skeleton that works (kind of like MVP) then scale it from there.

You don’t really need claude skill, subagent, plugin, another MCP, claude code/opencode wrapper, or any tool most of the time, vanilla setup of claude code is fine. Adjust based on what you need, which is most of the time, context that an agent needs, not glue-ing and blatantly trying all of the tools because some people told you so as they said

wow this < devtool > is very amazing, its gonna be gamechanging bro < insert pointing soyjak meme here > -random ai influencer

Also here is the claude code setup from the one, who initiated claude code in Anthropic, surprisingly vanilla enough (for me)

Boris Cherny@bcherny

I'm Boris and I created Claude Code. Lots of people have asked how I use Claude Code, so I wanted to show off my setup a bit. My setup might be surprisingly vanilla! Claude Code works great out of the box, so I personally don't customize it much. There is no one correct way to

The point of context engineering is to provide highest signal to noise ratio as much as possible to LLM. Noise will make the context bloated and irrelevant (context poisoning), as the “memory” of LLM still not fully-solved yet. RAG, search tool, spec/TDD, etc, all leads to the context & memory of an agent.

Feedback Loops as Infrastructure

An iteration. A feedback loop for an agent, to make sure that final output meets our requirements. An LLM generates wrong logic? Wrong datatype? Unit testing and typecheck of our verification process will catch this first then return back the verification result to an LLM to resolve the issue. We can model this as a learning iteration of building a model from training -> evaluation -> result -> feedback -> training again.

Here is hypothetical anatomy of a feedback loop as our foundation to use it as infrastructure of our agentic workflow:

anatomy of feedback loop

Action: action means *whats being attempted here?** Specification of your software requirements, feature specification, PRD, or your detailed prompt/instruction for an LLM as markdown file can be classified as an action here.
Outcome: literally the result of action we specified before. Result of the code written by an LLM with our instruction.
Signal: what does the outcome tell us about the action we took? This is where we know whether our code that we generated through an LLM is correct or not. It shows clear error message, our verification layer caught the bug/error or at least we know that it didn’t work once we executed the code if our “signal” part is not good enough.
Adjustment: targeted fix based on the signal part. An LLM is able to know what went wrong based on the signal output such as suggested fix from the verification layer, typecheck error, error logs from the backend, code execution logs, and the testing results. If the adjustment part was designed badly, an LLM will do random guessing what went wrong on the code when it generates error, not properly knowing the root cause and do targeted fix/adjustment.

Assume that I want to ship software as fast as possible while the code generated by an LLM is correct and high quality enough. Not some extremely slop bloated piece of code with hard-coded secret stored directly in code. To tackle this, we want to design the feedback loop design with:

instant feedback
tight loop
strict coding standards/styles
automated testing(?)

The variable we can control to achieve those goals are loop speed and signal clarity. Clear and precise signal, faster workflow, and targeted fixes, we don’t need to re-prompting to the agent, let the agent consume the code execution logs/results directly as their memory for next action iteration.

Instead of <le input spec and prompt.md here> -> agent writes code -> we review the code -> "claude, this is wrong, pls fix blabla" -> "agent rewrites" -> we review it again next day -> <till the code working perfectly fine>;

We could go something like this: <le input spec/prompt.md> -> agent writes code -> tests run instantly through our verification layer -> "expected X, got Y instead at line 67" -> agent fixes immediately from the output -> tests pass

Clear and precise signal, faster workflow, and targeted fixes, we don’t need to re-prompting to the agent, let the agent consume the code execution logs/results directly as their memory for next action iteration.

Also comparing the feedback loop above, we can see clearly that the second loop is not only faster but also more automated, we are removing ourselves from the loop compared the first one. We are the bottleneck and the slowest part of the feedback loop (unironically lol). With second loop design, we pushed as much strict verification layers as possible with clear signal from the verification layer itself to be fed to an agent for next action iteration. We make the judgement call in the first part only, in the first iteration, not every iteration cycle.

Recently, there’s a simple yet surprisingly effective loop design that I can provide as an example here. Let me introduce ralph loop, introduced by Geoffrey Huntley in last year but it went viral this year.

It’s literally just a bash loop ⁴, while :; do cat PROMPT.md | claude --dangerously-skip-permissions ; done

The idea is claude or any code agent executes the same prompt over and over again repeatedly in continuous loop until it meets completion criteria. There was claude official plugin for ralph loop. Since it was very buggy plugin, think it was deleted from official plugin by the anthropic staff but you can check the deepwiki of the plugin here.

ralph-loop meme

Accidentally this ralph loop design pattern is one of the great examples of the feedback loop as infrastructure here. Each iteration sees the results of all previous iterations; reads the current states of file/code changes, sees what was done before, and builds incrementally on previous work aka self-referential context accumulation.

Iteration 1: Initial implementation attempt → creates files
Iteration 2: Reviews own work → identifies issues → fixes
Iteration 3: Further refinements → handles edge cases
Iteration N: Detects completion criteria → terminates

default approach	ralph loop approach
human reviews each step	loop IS the review mechanism
feedback requires intervention	feedback is automatic via filesystem state
progress requires prompting	progress is continuous until completion
infrastructure supports feedback	infrastructure is feedback

I made ralph loop script (well actually generated by opus 4.5 xd) for my nixos vm with clan framework as module here:

modules/ralph-script/
├── PROMPT.md.example
├── default.nix
└── ralph.sh

the ralph loop bash script (extended):

set -euo pipefail

PROMPT_FILE="PROMPT.md"
MAX_ITERATIONS=0  # 0 = unlimited
DELAY_SECONDS=0
COMPLETION_PROMISE=""

RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color

# Parse command line arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        --prompt)
            PROMPT_FILE="$2"
            shift 2
            ;;
        --max)
            MAX_ITERATIONS="$2"
            shift 2
            ;;
        --delay)
            DELAY_SECONDS="$2"
            shift 2
            ;;
        --promise)
            COMPLETION_PROMISE="$2"
            shift 2
            ;;
        --help|-h)
            usage
            exit 0
            ;;
        *)
            echo "Unknown option: $1"
            usage
            exit 1
            ;;
    esac
done

# Allow environment variables to override if args not provided
PROMPT_FILE="${PROMPT_FILE:-PROMPT.md}"
MAX_ITERATIONS="${MAX_ITERATIONS:-0}"
DELAY_SECONDS="${DELAY_SECONDS:-0}"
COMPLETION_PROMISE="${COMPLETION_PROMISE:-}"

# Validate prompt file exists
if [[ ! -f "$PROMPT_FILE" ]]; then
    echo -e "${RED}Error: Prompt file '$PROMPT_FILE' not found${NC}"
    echo "Create a PROMPT.md file with your task, or specify one with --prompt"
    exit 1
fi

# Check if claude is available
if ! command -v claude &> /dev/null; then
    echo -e "${RED}Error: 'claude' command not found${NC}"
    echo "Please install Claude Code CLI first"
    exit 1
fi

# Trap CTRL+C for graceful shutdown
trap 'echo -e "\n${YELLOW}=== Ralph Loop interrupted ===${NC}"; exit 0' INT

# Print startup info
echo -e "${BLUE}=== Ralph Loop Starting ===${NC}"
echo -e "Prompt file: ${GREEN}$PROMPT_FILE${NC}"
echo -e "Max iterations: ${GREEN}${MAX_ITERATIONS:-unlimited}${NC}"
[[ $DELAY_SECONDS -gt 0 ]] && echo -e "Delay between iterations: ${GREEN}${DELAY_SECONDS}s${NC}"
[[ -n "$COMPLETION_PROMISE" ]] && echo -e "Completion promise: ${GREEN}$COMPLETION_PROMISE${NC}"
echo -e "${BLUE}===========================${NC}"
echo ""

# Main loop
iteration=0
while true; do
    ((iteration++))

    echo -e "${BLUE}=== Ralph Iteration $iteration ===${NC}"
    echo -e "${YELLOW}$(date '+%Y-%m-%d %H:%M:%S')${NC}"
    echo ""

    # Run claude with the prompt
    # Using --dangerously-skip-permissions for fully autonomous operation
    # Using --print to get output for completion detection
    if [[ -n "$COMPLETION_PROMISE" ]]; then
        # Capture output for promise detection
        output=$(cat "$PROMPT_FILE" | claude --dangerously-skip-permissions --print 2>&1) || true
        echo "$output"

        # Check for completion promise
        if echo "$output" | grep -qF "$COMPLETION_PROMISE"; then
            echo ""
            echo -e "${GREEN}=== Completion promise detected! ===${NC}"
            echo -e "Total iterations: $iteration"
            exit 0
        fi
    else
        # No promise detection, just run
        cat "$PROMPT_FILE" | claude --dangerously-skip-permissions || true
    fi

    echo ""
    echo -e "${BLUE}=== Iteration $iteration complete ===${NC}"

    # Check max iterations
    if [[ $MAX_ITERATIONS -gt 0 ]] && [[ $iteration -ge $MAX_ITERATIONS ]]; then
        echo ""
        echo -e "${YELLOW}=== Max iterations ($MAX_ITERATIONS) reached ===${NC}"
        exit 0
    fi

    # Optional delay
    if [[ $DELAY_SECONDS -gt 0 ]]; then
        echo "Waiting ${DELAY_SECONDS}s before next iteration..."
        sleep "$DELAY_SECONDS"
    fi

    echo ""
done

Key points:

--promise TEXT: Completion detection via magic string (e.g., COMPLETE)
--max N: Safety limit on iterations
--delay N: Rate limiting between iterations (it’s not really needed, optional flag)

When you make a prompt, whether as a direct prompt on interface or as a markdown file. We only need two things:

Task, what you want done
Completion Criteria, when done, output: FINISH or DONE

An example:

## Your Task

Build a REST API for a todo application.

## Requirements

1. CRUD operations for todos (create, read, update, delete)
2. Input validation for all endpoints
3. Proper error handling
4. Unit tests with >80% coverage
5. API documentation in README

## Completion Criteria

When ALL of the following are true, output `<promise>COMPLETE</promise>`:
- All CRUD endpoints are implemented and working
- Input validation is in place
- All tests pass
- Coverage is >80%
- README has API documentation

you can also add an addition such as:

## Iteration Strategy

Each iteration you should:
1. Check the current state of the project (run tests, check for errors)
2. Identify what needs to be done next
3. Implement one focused improvement
4. Verify your changes work

## Important

- Work incrementally - don't try to do everything at once
- Run tests frequently to catch issues early
- If stuck, re-read the requirements and try a different approach
- Output `<promise>COMPLETE</promise>` ONLY when ALL criteria are met

From a simple claude wrapper bash script we made, we can invoke the ralph loop with something like ./ralph.sh --promise "DONE" --max 50 which is the completion signal is when an LLM invokes <promise>DONE</promise> to end the iteration and signal that coding process is finished with 50 iterations as the limit.

There is also examples from awesomeclaude.ai based on claude official plugin that already deleted:


# feature implementation

/ralph-loop:ralph-loop "Implement [FEATURE_NAME].

Requirements:
- [Requirement 1]
- [Requirement 2]
- [Requirement 3]

Success criteria:
- All requirements implemented
- Tests passing with >80% coverage
- No linter errors
- Documentation updated

Output <promise>COMPLETE</promise> when done." --max-iterations 30 --completion-promise "COMPLETE"

# TDD development

/ralph-loop:ralph-loop "Implement [FEATURE] using TDD.

Process:
1. Write failing test for next requirement
2. Implement minimal code to pass
3. Run tests
4. If failing, fix and retry
5. Refactor if needed
6. Repeat for all requirements

Requirements: [LIST]

Output <promise>DONE</promise> when all tests green." --max-iterations 50 --completion-promise "DONE"

There is more examples you can check out directly on awesomeclaude.ai as usage references.

As ralph loop is current design pattern meta for vibecoding, maybe in the future there might be change there will be a better design pattern emerging as model capabilities and intelligence will improve. Nevertheless, my point still intact. Ralph loop is an example of the feedback loop as foundational design pattern for agentic workflows.

Now get back to the feedback loop itself. This means that to make our code verification fast and strict, so it’s more predictable and straightforward to see whenever our code works or not. What we can do are:

Type checkers and linters as verification layer that you can put it on your pre-commit hook or pre-push hook as gitops for instant feedback/signal. Ex: nix fmt and nix flake check.
Fast infrastructure system, as code generation become much faster than before, we also need CI/CD pipeline that doesn’t take too long to run. We need faster build that fails with clear error if possible in seconds.
Small, scoped tasks as smaller action outputs to faster outcomes, thus quicker adjustment -> faster loop cycle. Not one shotting in a session (modularization is keypoint here).
Test as the feedback mechanism for clear signal.
Make failure cheap, build the sandbox, define good version controlling (aka gitops), easy rollback. Run claude yolo mode on sandbox (containerized).
Use strict languages. Will write more about this in the next section.

Verification > Trust

You’re absolutely right

That’s what it said, trust no one, an agent or yourself, verify the code, test the code as what I wrote earlier on previous section.

trust nobody meme

You have heard about this often. That’s right. You’re indeed absolutely right. You can say the most wrong statement anyone has ever heard and yet claude will glaze you with you’re absolutely right. They (agents) are confident while frequently wrong. The correct way to tackle this is verify by default.

Verification layers:

Type systems (catch structural errors)
Tests (catch behavioral errors/logical errors)
Linters/formatters (catch style errors)
Second opinion of agent (catch logic errors if tests are not enough, usually for complex logic)
Ourself (judgement, true verification layer)

Verification layers help us save time to review code that actually needs to be judged properly, we don’t need to waste our time to review all of the code manually. There is one more verification layer that might be important in the future, language choice.

Language Choice as Verification Strategy

This section is highly opinionated, I might be very wrong here. Purely hypothesis especially for future trained LLM preparation. Take it with a grain of salt..

LLM is non-deterministic, it’s a slot machine that predicts the next tokens based on the input. A prediction, so the output is more likely probabilistic. Okay now imagine we are using OOP, it varies in determinism. There are deep inheritances and mutable state everywhere. It might compound chaotic effects.

So my hypothesis here is that the combination matters:

non-deterministic generator + deterministic paradigm = chaos contained
non-deterministic generator + non-deterministic paradigm = chaos compounds

Paradigm	Determinism	Why
Pure FP + strict types	High	`f(input) → output`, always same. No side effects. Compiler enforces correctness.
Strict static typing	High	Multi-paradigm but compiler catches errors before runtime. Ownership, borrowing, strong type inference. (Rust, Go)
FP + dynamic types	Medium-high	Pure functions, but type errors surface at runtime
Imperative + strict types	Medium	Predictable control flow, but mutable state adds complexity
OOP + strict types	Medium-low	Inheritance, polymorphism, state — multiple ways to be subtly wrong
OOP + dynamic types	Low	State + inheritance + runtime typing = many degrees of freedom for bugs

Alright here we go, let’s start another programming language discourse, FP vs OOP, rust vs python, etc. You can disagree with me, I don’t really mind as my current state is framework/proglang agnostic and I’m leaning to software correctness so as long as there is the programming method that I found lean to more correctness while being simple, pragmatic, and effective, I will switch, learn, and try to adapt to it.

programming language discourse

As an LLM returns non-deterministic output, why would you code something that’s non-deterministic again rather than deterministic and purely declarative leaning language? ~~This is where nix fixes this~~.

At least ~~writing~~ generating functional programming styled code is more predictable than writing OOP code as we eliminate potential bugs as much as possible by decreasing side effects that might occur (even though there is still possibility of hidden bug later on).

This is why I’m betting myself to learn and lean towards more on declarative, functional programming way, also more focusing on language with less abstraction layer, more performance-focused, maybe less resource intensive (less bloated) too. I already abstract my way of coding using natural language to the probabilistic machine, why would I instruct the probabilistic machine to write dynamic unpredictable, more abstracted, programming language?

hypothetical error strictness graph

We wrote in programming languages that are more understandable and there is more abstraction combined with plenty of optimized ready-to-use libraries and frameworks with python and javascript (also php with laravel). Usually these programming languages were used by startups or new companies to develop and ship their product as fast as possible. Do current SoTA LLMs write high quality C/C++ or rust?

First, take a look at top programming languages on github from github octaverse 2024 ⁵ and github octaverse 2025 ⁶:

octaverse 2024 octaverse 2025 octaverse 2025, bonus

As most of github codebases are dominated by python and typescript/javascript. I assume that LLMs are really great at writing dynamic type programming language (python and javascript) as there is more variety of codebase, more data trained on, and more examples of code to learn from. Even though the codebases including both bad written code and best-implementation of “the way” or “correct” way of writing code. I also assume these AI frontier giants curated their coding dataset especially anthropic, to make their LLM to be good at coding.

Well, there is also survey conducted by stackoverflow, the quora of programming, one of the biggest programming dataset providers on the internet ⁷. I also believe that (or maybe quite obvious) stackoverflow’s datasets are higher quality and more curated than github codebases itself because we want the user answer as de facto right answer (via most voted+verified answer sign) to the related question in the context of programming.

stackoverflow chart 2025

We can conclude that the biggest training dataset from last year is javascript, then followed by python and typescript.

Despite that, some said that opus 4.5 is already extremely smart and good at rust already.

Dioxus 🧬@dioxuslabs

guys Claude Code just conquered Rust it went from stupid in Rust to extremely, extremely smart I just built a vmware vsphere clone today with built-in vnc, websocket proxy, etc in about 6 hours of work we're cooked

cache crab@cachecrab

its actually GGs. opus 4.5 in the right harness can make perfect backends in rust

cache crab@cachecrab

holy shit claude is so good at rust now in zed i just /tab all to insert all open tabs into context (which is the whole project, 3 rust files with 500 lines avg) it one shots adding features including writing html using maud::html!, tailwind css, axum form extraction

Zed@zeddotdev

when the CEO is AI pilled 💊 @nathansobo is having the time of his life with Opus 4.5! Finally a model that's actually good at Rust ✨ https://t.co/cRDcPoaiyE

Igor Babuschkin@ibab

Opus 4.5 is surprisingly good at writing decent Rust code and I expect this will make the language much more popular than it already is. Rust used to have a steep initial learning curve but now it’s easy to get started. Anyone can now reap benefits like efficiency and decent

Those tweets I embedded here are including one of the popular app framework, dioxus and from zed itself (an IDE written in rust). Also mentioning founding engineer of helixdb, a popular database written in rust and co-founder of xAI itself which I believe they are pretty good at writing rust before (I mean they have credibility here).

As you can ~~write~~ generate code (at least a boilerplate) with initial steep learning curve programming languages like rust, nix, and haskell with help of an LLM, I don’t think that in the future, we need to use python or javascript to speedup our development process as we can benefit these language performance benefits as Igor said in tweet above.

TypeScript grew by over 1 million contributors in 2025 (+66% YoY), driven by frameworks that scaffold projects in TypeScript by default and by AI-assisted development that benefits from stricter type systems.

— GitHub Octoverse 2025

Last year report of GitHub Octoverse supports my thesis about strict verification matters, especially when code generation itself is non-deterministic as type-safe javascript ~~mogged~~ beat python growth last year. We can see that despite historically javascript being dynamic, the trend is toward strictness.

And libraries like effect.ts are bringing typescript ecosystem to more strictness and adapting to heavy FP pattern including immutability by default, composable, pure functions, typed-errors, and explicit side effects management.

While Effect makes usage of Functional Programming principles >and patterns internally you can be proficient in Effect by >simply using it as a smart Promise and forget that there is >even a thing called Functional Programming

— Effect.ts

My thesis is languages with strict, fast feedback loops will disproportionately win the agentic era (present-to-future).

Declarative Systems as Agentic Playgrounds

We want to sandbox our agent environments where agents can experiment and operate safely inside our environment. The ideal sandbox is explicit state, purely isolated, easy rollback, clear error, and easily reproducible. So you can run claude --dangerously-skip-permissions without overthinking what might happen once it does something stupid.

To make declarative configuration sandbox for the agent, we can use something like docker, terraform, or kubernetes (for multiple-nodes environments). However, guess what?

Nix fixes this

Tool	Declarativeness	The catch
nix	pure	config → system is a pure function. same input = same output, always.
terraform	high	declarative HCL, but has a state file. state can drift from reality.
kubernetes	intent-based	you declare desired state, controllers reconcile. but debugging “why isn’t it converging?” is opaque.
docker compose	structural	declares service topology, but underlying images may not be reproducible.
dockerfile	mostly imperative	`RUN apt-get update` — order matters, results vary over time. layers are cached, not content-addressed.

Ideally (and personally), to run an agent as wild as possible in dedicated environment, I want to make the “environment” versatile and flexible, we can assign it to systemd, or containerized it with docker/podman, or maybe assign and set it up for orchestration systems (like kubernetes) while still being fully-declarative, remain “infrastructure as a code” for easy reproducibility and machine management. This is where using nix shines for me.

I can configure declaratively, creatively using several frameworks, tools, combined with other programming languages with nix. Nix isn’t only functional programming language, it’s also a purely functional package manager tool and system configurator too. The core model of nix is:

configuration nix -> nix builds -> /nix/store/<hash>-result

Same input, same output. The stored path is always isolated through /nix/store, which is a content-addressed file system.

property	what it means	why agents love it
purity	no side effects during build. no network calls mid-build (unless explicitly declared).	reproducible. agent can retry infinitely with same result.
content-addressed store	every build artifact is hashed by its inputs. `/nix/store/abc123-...`	no “it worked yesterday.” either the hash matches or it doesn’t.
atomic operations	upgrades/rollbacks are instant. switch a symlink.	no “rolled back.” build environment is explicit.
hermetic builds	builds happen in isolated sandboxes. no access to global state.	no “it worked on my machine.” build environment is explicit.
declarative system config	`configuration.nix` describes your entire system.	change config → rebuild → entire system matches declaration.
flakes	locked dependencies, reproducible entry points.	`nix develop` gives exact same shell everywhere. agent doesn’t deal with “wrong version” issues.

Sandboxing in nix is very flexible as we can isolate multiple layers of our choice from the computer system/kernel, network, filesystems, application, write-access, until application-layer. Almost everything can be declared and configured in nix.

Build environment:
├── No network access (unless explicitly allowed)
├── No access to /home, /tmp, or anything outside declared inputs
├── Read-only access to dependencies in /nix/store
├── Write access only to build output directory
└── Isolated /etc, /usr, etc.

This means:

agent can’t accidentally break your system during a build (unless there is wrong configuration/method)
agent can’t rely on hidden state
if it builds in the sandbox, it (should) build anywhere, reproducibility guarantee (98%, i’m not that confident)

There are multiple patterns with a goal, to make customized sandbox on nix. Example:

nix shell (one-off experiments)
dev shell (nix develop)
nixos vm (nixfied linux distro)

I have been experimenting with nixos vm to build a customized agent playgrounds using clan since last year and it’s been fun so far. I can customize and vibecode bunch of features without fear of breaking my machine and make my agent operate wildly.

Okay maybe it’s too biased and too specific now as we are explaining about nix instead of higher abstraction for how we interact and work with AI from present to the future. Literally contradicts what I wrote earlier in the first place. Currently, I use nix as devops and infra system for most of my workflow as aligned as it is with my mental models of working with AI agent itself. It might change in the future if there are any better alternative programming languages that suitable for environment and infrastructure building for agent playgrounds. This is also my bet that nix ecosystem will do really well personally because of current state of AI-powered coding and development.

Environment as Interface

Now let’s redefine how we interact and work with an LLM. If opus writes all of our code, and we use something like another claude or coderabbit to review our PR and code, do we need a laptop? If our current workflow is more like reviewing and verifying the code, we don’t really need IDE at all?

Why would we need to use IDE then? Why are we opening VSCode just to read and review the code? Do we really need code editor as an interface for development workflow?

My Current Setup

Recently, I’ve been exploring and experimenting with my own nixos vm (using clan framework) derived from baremetal server I bought last year as agent-first dev vm. I wonder how I’d still be able to ship some code through phone in hand. It should come in handy when I was on public transport or anywhere, right??

The idea is simple (and unoriginal), I believe that a bunch of people have already set up this workflow anyway.

my phone -> connect to tailscale vpn -> ssh agent@nixos-vm via termius -> run claude inside termius

It’s quite simple, but I want it more accessible, especially the UX part, why would I open terminal then run ssh on my phone? Then I tried to integrate claude code with my telegram as telegram bot is easier to set up (via BotFather) rather than whatsapp or discord (personally).

Isn’t there claude code on claude mobile app?

I want to make it more customizable for me, fit my workflow, not limited to my own github, I might make claude as a personal sysadmin my dev vm too so I decide to integrate it with my telegram directly.

There are two ways of integrating claude code with telegram:

claude code cli command (ex: claude -p "pls implement auth")
using agent sdk

I’m ending up using two of these methods anyway.

So, fast-forward here is my current personal dev vm setup and its system visualization with mermaid.js graph that I made by asking to claude code since I’m too lazy to make it myself from scratch on excalidraw (I also asked claude to make mermaid graph wrapper component for my blog immediately as I’m writing this blog),

flowchart TB
    subgraph Phone["📱 iPHONE"]
        Termius["🖥️ Termius<br/>SSH Client"]
        TelegramApp["💬 Telegram<br/>Bot Interface"]
    end

    subgraph Network["🌐 MESH NETWORK"]
        Tailscale["🔗 Tailscale VPN<br/>100.x.x.x<br/>P2P Encrypted"]
        TelegramAPI["📡 Telegram API<br/>api.telegram.org"]
        ZeroTier["🌍 ZeroTier VPN<br/>Backup Network<br/>Controller: greencloud"]
    end

    subgraph DalangDev["🖥️ DALANG-DEV (NixOS KVM VM)"]
        subgraph Systemd["⚙️ SYSTEMD SERVICES"]
            Bot["🤖 claude-telegram-bot<br/>TypeScript/Node.js<br/>/claude /spawn /sessions<br/>/switch /send /output /kill"]
            Hooks["🪝 claude-hooks-setup<br/>settings.json generator"]
            Notify["📢 claude-notify<br/>Telegram notifier"]
            TailscaleD["🔐 tailscaled"]
            ZeroTierD["🌍 zerotier-one"]
        end

        subgraph Execution["🚀 AGENT EXECUTION"]
            Parallel["⚡ Parallel Claude Engine<br/>tmux + git worktrees<br/>Max 10 sessions<br/>/var/lib/claude-parallel"]
            Ralph["🔄 Ralph Loop<br/>Autonomous iterations<br/>ralph --max 50 --promise OK<br/>Reads PROMPT.md"]
        end

        subgraph Agents["🧠 AI AGENT TOOLS"]
            Claude["💜 Claude Code<br/>Opus 4.5 / Sonnet 4<br/>from llm-agents"]
            OpenCode["🟢 OpenCode<br/>Gemini 3 Pro/Flash<br/>Antigravity plugin"]
            Codex["🔵 Codex<br/>GPT-4 / o1<br/>OpenAI"]
        end

        subgraph Sandbox["🛡️ BUBBLEWRAP SANDBOX"]
            RW["📝 READ-WRITE<br/>/home/agent/workspace<br/>/tmp /var/tmp"]
            RO["🔒 READ-ONLY<br/>/ ~/.ssh ~/.gitconfig<br/>~/.claude ~/.config/opencode"]
        end

        subgraph Tools["🛠️ DEV TOOLS"]
            NixVim["📝 NixVim<br/>LSP: nixd, pyright<br/>ts_ls, gopls, rust_analyzer"]
            GitTools["🔀 Git Tools<br/>git, gh, jj, lazygit, tea"]
            DevEnv["📦 Dev Env<br/>direnv, devenv, uv<br/>Python 3, nushell"]
        end

        subgraph Security["🔐 SECURITY LAYER"]
            SSH["🔑 SSH Hardening<br/>Key-only, curve25519<br/>chacha20-poly1305"]
            Firewall["🧱 Firewall<br/>TCP 22, UDP 9993"]
            Kernel["🐧 Kernel Hardening<br/>SYN cookies, rp_filter"]
        end
    end

    subgraph Remotes["☁️ GIT REMOTES"]
        GitHub["🐙 GitHub<br/>0xrsydn"]
        Forgejo["🔧 Forgejo<br/>git.rasyidanaf.com<br/>96.9.212.43"]
    end

    Phone --> Network
    Termius --> Tailscale
    TelegramApp --> TelegramAPI
    Tailscale --> DalangDev
    TelegramAPI --> Bot
    ZeroTier --> DalangDev

    Bot --> Parallel
    Bot --> Notify
    Parallel --> Agents
    Ralph --> Agents
    Agents --> Sandbox
    Hooks --> Notify
    Sandbox --> Tools
    Tools --> Security
    GitTools --> Remotes

The claude telegram bot flow:

sequenceDiagram
    participant iPhone as 📱 iPhone Telegram
    participant API as 📡 Telegram API
    participant Bot as 🤖 claude-telegram-bot
    participant Tmux as ⚡ tmux session
    participant Claude as 💜 Claude Code
    participant Hooks as 🪝 claude-hooks
    participant Notify as 📢 claude-notify

    iPhone->>API: /claude "fix the auth bug"
    API->>Bot: Webhook message
    Bot->>Tmux: Spawn new session
    Tmux->>Claude: Execute in sandbox

    Claude->>Claude: Read files, write code
    Claude->>Claude: Run tests, commit

    Claude->>Hooks: SessionEnd trigger
    Hooks->>Notify: Send notification
    Notify->>API: POST message
    API->>iPhone: "✅ Task complete!"

    Note over iPhone,Claude: Parallel Sessions
    iPhone->>API: /spawn feature-branch
    API->>Bot: Create new worktree
    Bot->>Tmux: New tmux window

    iPhone->>API: /sessions
    API->>Bot: List active
    Bot->>API: "Session 1: main, Session 2: feature-branch"
    API->>iPhone: Show sessions

    iPhone->>API: /switch 2
    Bot->>Tmux: Attach to session 2

SSH access via termius

flowchart LR
    subgraph Phone["📱 iPhone"]
        Termius["🖥️ Termius"]
    end

    subgraph Option1["Option 1: ProxyJump"]
        MacBook["💻 MacBook<br/>Jump Host"]
    end

    subgraph Option2["Option 2: Direct Tailscale SSH"]
        TailscaleSSH["🔗 Tailscale SSH<br/>--ssh mode"]
    end

    subgraph Target["🖥️ dalang-dev"]
        VM["NixOS VM<br/>root@nixos-dev"]
    end

    Termius -->|"Tailscale VPN<br/>100.x.x.x"| MacBook
    MacBook -->|"ProxyJump<br/>SSH"| VM

    Termius -->|"Direct<br/>100.x.x.x"| TailscaleSSH
    TailscaleSSH -->|"Identity Auth<br/>No password"| VM

    style Option2 fill:#2d5a27,stroke:#4ade80

Sandbox security model

flowchart TB
    subgraph Outside["🌍 OUTSIDE SANDBOX"]
        Agent["👤 agent user"]
        Commands["$ sandbox-claude<br/>$ sandbox-opencode<br/>$ sandbox-codex"]
    end

    subgraph Bubblewrap["🛡️ BUBBLEWRAP NAMESPACE"]
        subgraph ReadWrite["📝 READ-WRITE (tmpfs)"]
            Workspace["/home/agent/workspace<br/>Code lives here"]
            Tmp["/tmp<br/>/var/tmp<br/>/run/user"]
        end

        subgraph ReadOnly["🔒 READ-ONLY BINDS"]
            Root["/ (root filesystem)"]
            SSH["~/.ssh<br/>Git credentials"]
            GitConfig["~/.gitconfig"]
            ClaudeConfig["~/.claude<br/>API keys"]
            OpenCodeConfig["~/.config/opencode"]
        end

        subgraph Allowed["✅ ALLOWED"]
            Network["🌐 Network Access<br/>git push/pull<br/>API calls"]
        end

        subgraph Blocked["❌ BLOCKED"]
            Destructive["rm -rf /<br/>System modifications<br/>Write outside workspace"]
        end
    end

    Agent --> Commands
    Commands --> Bubblewrap
    ReadWrite --> Allowed
    ReadOnly --> Allowed

    style Blocked fill:#7f1d1d,stroke:#ef4444
    style Allowed fill:#14532d,stroke:#22c55e

And last.. clan module structure diagram..

flowchart TB
    subgraph Clan["🏠 CLAN FRAMEWORK"]
        FlakeNix["flake.nix<br/>Entrypoint"]
        ClanNix["clan.nix<br/>Inventory & Instances"]
    end

    subgraph Machine["🖥️ machines/dalang-dev/"]
        Config["configuration.nix"]
        Disko["disko.nix<br/>Disk layout"]
        Facter["facter.json<br/>Hardware facts"]
    end

    subgraph Modules["📦 modules/"]
        AITools["ai-tools.nix<br/>Claude, OpenCode, Codex"]
        Sandbox["agent-sandbox.nix<br/>Bubblewrap isolation"]
        AgentGit["agent-git.nix<br/>Git & SSH config"]
        Hooks["claude-hooks.nix<br/>Event hooks"]
        Notify["claude-notify.nix<br/>Telegram alerts"]
        Parallel["parallel-claude/<br/>Multi-session support"]
        RalphScript["ralph-script/<br/>Autonomous loop"]
        NixVimMod["nixvim.nix<br/>Neovim + LSP"]
        Dev["dev/<br/>git-tools, tmux, devenv"]
    end

    subgraph Services["🔧 services/"]
        TelegramBot["claude-telegram-bot/<br/>Telegram Bot Service"]
    end

    FlakeNix --> ClanNix
    ClanNix --> Machine
    ClanNix --> Services
    Config --> Modules

    Config --> AITools
    Config --> Sandbox
    Config --> AgentGit
    Config --> Hooks
    Config --> Notify
    Config --> Parallel
    Config --> RalphScript
    Config --> NixVimMod
    Config --> Dev

This is my own personal setup, I’m not going to say that this is best practice or whatever it is, but I want to redefine and make my workflow much simpler and effective at the end of the day.

Termius + tailscale + VM setup of mine..

dogfiles@0xrsydn

replacing my doomscrolling habit with abusing opus cockus instead https://t.co/bQ4NdpsRlu

And few showcases of my telegram bot utilizing both claude cli and an agent sdk..

dogfiles@0xrsydn

spawning my baby beloved opus cockus inside nixos vm 🥺🥺 https://t.co/Z6Ot371lW5

dogfiles@0xrsydn

this is probably the closest thing you can do without building mobile app from scratch and maintain to interact with your claude code+vm with GUI, either telegram or whatsapp the next thing i should do is setting up tmux as telegram commands to assign multiple claude sessions https://t.co/MyIhG1dZnB

The agent doesn’t need your monitor, keyboard, physical presence, such as your laptop to be open as we can control it remotely.

What the agents need:

a shell
access to files (ssh, ftp, etc)
network
computer, physically or cloud (ram, cpu, storage/ssd)
~~gpu soon~~
~~llm api key/claude code subscription~~

Do work that really matters, our decision making process, giving direction, and reviewing/human judgement time.

The Mental Model Shift

Hence, we can shift our mental model from as an operator perspective to as a supervisor when we are fully-utilizing an LLM to be as our codemonkey.

old vs new mental model

We don’t need to see the code writing process, Just need to be present for decision making, code review, taste judgement, and act as the human itself in the human-in-the-loop process (even some of method like ralph loop may not need human intervention).

From this perspective, we can derive it as environment design principles for our setup:

headless first, if we need GUI like opening an IDE, it’s not agent-friendly.
persistent sessions. We want to keep the agent running as long as possible ~~especially codex that running in the background for 10 hours alone~~. Disconnect shouldn’t kill the work. Use tmux, screen, or zellij to keep the session alive.
separation of concerns. The VM where work happens is stateful while our devices are stateless clients as windows into the work.

Recalling our sci-fi perspective, using jarvis of ironman and t-sphere of mr.terrific as our references here, probably in the future we can view it the interaction between the human and machine via natural language voice. There is emerging app that focuses on turning speech into polished writing like WisprFlow as the closest thing I can imagine to make similar jarvis-tony stark interaction level mimicry.

I haven’t tried it but for mac user, there is local model that we can use such as whisper.cpp to plug in openai whisper locally with your mac. Or..

🎭@deepfates

Can everyone please stop using wispr flow superwhisper and whatever subscription app for your Mac please! Offline local models are now better than the ones they're charging you for. Use something like "spokenly" with Nvidia parakeet models and thank me later

Knowing When to Take the Wheel

Even though I wrote about delegation, fully-AI feedback loop most of the time here, not everything should be delegated. Knowing when to delegate and take over are very important too.

Delegate when:

task is well-specified
verification is cheap or almost fully-automated
time is the bottleneck
failure is recoverable
the task is safe enough to be handled by an LLM

Take over when:

when we are debugging the agent more than the problem
context might be too complex to transfer efficiently
when we could finish the task faster by doing it ourselves
critical human-needed review task (ex: handling prod database)
the task requires judgement we can’t specify

Closing Statement

Different than most of previous blogs I’ve written so far, in this writing, I’m emphasizing more on the thinking process and our approach as we, the human to interact and work with an LLM. What I wrote here might be outdated in the future as the technology of this machine god evolves beyond our current understanding.

We’ve heard that “AI is going to replace the JOB”, “AI will automate most of our life aspects”, and the other things over years. Despite all of that, what truly matters right now are:

our decision-making ability, the human judgement
ability to adapt to new situations (obviously)
communicate effectively with machines (context engineering??)
ability to collaborate with other humans
creativity and artistic taste
~~ability to act and do the thing lol, ppl call it as “high agency”~~

taste is all you need

I don’t know what’s coming next, I might be very very wrong again like previous blogs but what I can do is adapt like mahoraga’s adapting to all incoming sorcerers’ attacks (sorry for inserting cringe jjk panel joke here)

mahoraga

Opinionated Mental Models for Agentic Workflows

2025 Wrapped, Enter the TUI-morphism

CLI.. Why???

Changed The Way I Work

Somehow Nix is Perfect DevOps Tool to Vibe with

My Current Setup

The Foundational Shift

Delegation Skill

The Mental Models

The Sci-Fi Lens

The Gaming Lens

Fundamental is More Important than Ever

Practical Primitives

Context Engineering

Feedback Loops as Infrastructure

Verification > Trust

Language Choice as Verification Strategy

Declarative Systems as Agentic Playgrounds

Nix fixes this

Environment as Interface

My Current Setup

The Mental Model Shift

Knowing When to Take the Wheel

Closing Statement

Footnotes