Most AI outputs aren't unpredictable; they're under-constrained. The model isn't the source of the chaos. The execution environment around it is. When a Claude run produces something you didn't expect, the first instinct is to blame the prompt, then the model, then the concept of large language models in general. The actual culprit is almost always upstream: vague intent, missing scope, no record of what changed last week. Reliability is an engineering property of the surrounding system. It isn't something the model brings to the table.
Where unpredictability actually comes from
There are three failure modes I see almost every time someone calls an AI output "wrong."
First, vague intent. A prompt that says "summarise this report" leaves four decisions to the model: what counts as a summary, what to drop, what tone to write in, how long to make it. That isn't the model being unreliable. That's four free variables you didn't specify. A prompt that says "summarise in three bullets, fifteen words each, neutral tone, lead with the financial impact" gives you a tight contract. Same model, completely different output distribution.
Second, missing scope guards. The agent has no rule about which files it can touch, what surfaces it can change, what it should leave alone. So it changes a recently-edited file you wanted protected, or it edits ten files when you wanted three. Constraints are how you tell an agent the difference between "this is in your job" and "leave that alone." Without them, you are trusting the model to read your mind. Models cannot read minds.
Third, untracked drift. Someone edited the prompt last Tuesday in a UI somewhere. There is no diff, no git blame, no rollback. When the agent starts producing different results on Friday, you have no way to attribute the change to a specific edit. You are debugging a moving target. This is the failure mode that scales worst: every undocumented edit compounds the next one.
None of these are model problems. All three are environment problems. Fix the environment and most "AI unreliability" disappears.
Treating AI infrastructure like code
The discipline that makes software systems reliable is well-rehearsed. Code lives in a repo. Changes go through pull requests. Reviews catch drift. CI runs on every commit. Rollbacks are one command away. Nobody calls these things "soft engineering."
Apply the same rules to your AI stack and the same gains follow.
Prompts go in files. Files live in a repo. Edits are commits with author and date. Reviewers check the diff before merge. When something breaks, git blame tells you when the rule changed and who changed it. A prompt without a commit history is the same kind of liability as production code without version control, except the failure mode is harder to spot because the model doesn't crash, it just answers differently.
Configurations follow the same rule. An agent's system prompt, its tool list, the schema it expects from a tool, the MCP server connections it relies on: all configuration. All belong next to the application code that depends on them. The MCP specification exists precisely because plugging models into tools should be a contract you can read, not a verbal handshake.
Skills, agents, and markdown prompts aren't "soft" artifacts. They are the execution environment.
The point isn't to make AI development more bureaucratic. The point is that bureaucracy and reliability are the same thing: a record of what you decided, why, and when. Most AI projects skip the record because the early demos work fine without it. Then six months in, nobody remembers what changed, and the team is doing forensics on prompts they wrote in February.
Nous Research and the alternative stack
For an existence proof that constraint discipline scales, look at Nous Research.
Their Hermes Agent is an agent runtime (a Python program with a documented loop, a tool-execution layer, and a configuration surface) that you can install on an Umbrel home server in fifteen minutes. The repo is NousResearch/hermes-agent. The whole thing is open and inspectable. There is no proprietary AI magic hiding behind a SaaS dashboard. The agent does what its code does, and you can read the code.
That's one piece. Nous's broader stack is a working argument for the same thesis at much larger scales.
The Hermes 3 and Hermes 4 model families are open-weights, frontier-adjacent quality, fine-tuned on top of Llama, Qwen, and other open foundation models. The naming overlap can be confusing ("Hermes" is both a runtime and a model family) but the runtime is model-agnostic. It can run any model that speaks the standard API contract.
DisTrO, short for Distributed Training Over-the-Internet, is their research line on training large models across geographically distributed, consumer-grade nodes. Training, the most centralisation-prone step in the whole AI pipeline, becomes a protocol that any contributor can implement. Constraints encoded as specification, not as a closed system.
Psyche Network is the production version of that protocol: a decentralised pre-training network where contributors run training nodes. The protocol is the constraint. Anyone who follows the protocol participates. Nobody owns the result.
Atropos is their reinforcement-learning environments framework. Batch trajectory generation. Trajectory compression for training the next generation of tool-calling models. RL environments specified explicitly, repeatable, peer-reviewable.
These aren't position papers. They are shipped infrastructure. And the design pattern is the same one this post argues for: when the constraints live in code, in a repo, with explicit specifications, the work scales from a single hobby box up to a distributed training fleet without changing the underlying engineering discipline.
In the centralisation-versus-alternatives map of AI politics, Nous sits firmly on the alternatives side. Their work is credible not because of where they sit politically (credibility doesn't come from a stance) but because they ship infrastructure other people can verify and build on. The lesson for any business deploying AI: pick your side by what you commit to your repo, not by what you say in interviews.
Flash loops, cron-as-agent, and model swapping
Three Hermes Agent capabilities make the abstract argument concrete.
The agent loop is just a function. run_conversation() follows a standard ReAct pattern: build the API messages from conversation history, make an interruptible API call, parse the response, execute any tool calls, append the results, loop back. Provider failover, prompt caching, compression at fifty percent context: all explicit, all readable in the repo. When the loop is code, "the agent did something weird" stops being mystical and starts being a debug session you can finish.
Cron jobs as agent tasks. This is the part that lands for a Malta SME. Hermes has a first-class cron system where scheduled jobs are agent tasks, not shell scripts. Each scheduled job runs through a fresh AIAgent with the configured prompt, optional skills, and a delivery target. Concrete example: schedule a Flash model to poll a Malta Enterprise grant page every five minutes, decide whether the deadline page has changed in a meaningful way, and ping you on Telegram when it has. The job lives in a config file you can edit and version. Each run leaves a journal entry. The whole loop is replicable on any node.
Models swappable via the hermes model command. When the contract between agent and model is explicit, the model becomes interchangeable.
Model swapping by config, not by code. hermes model switches between Nous Portal, OpenRouter (two hundred plus models), NVIDIA NIM with Nemotron, Xiaomi MiMo, z.ai's GLM family, Kimi from Moonshot, MiniMax, Hugging Face hosted models, OpenAI, with no code changes. Sonnet for the gateway terminal interface where reasoning depth matters. Flash for the cron-based polling loops where latency and cost matter. An open-weights model for batch jobs where data residency matters. The constraint isn't which model. It is the contract the agent expects from any model. When the contract is explicit, the model becomes a deployment choice. When it isn't, you are locked in to whoever you started with.
The connecting thread: the loop is code, the cron schedule is configuration, the model selection is configuration. Configuration that lives in a repo, gets reviewed, produces journaled runs. That is what makes the same pattern work on a hobby Umbrel install and on a production cron fleet.
Above all of it sits Claude Code. The agent runtime, the cron schedules, the prompts, the journals: every piece of the stack this post argues for is text in a repo. The same tool that writes your application code can write, review, and modify your AI infrastructure. The recursion lands cleanly: when prompts are files, an AI coding assistant edits them like any other file. Pull request review applies. CI applies. git blame applies. The meta-layer isn't a separate discipline; it is the same discipline applied one level up. The payoff compounds: every refinement to a prompt or config lands in the same version-controlled surface your engineers already use, and the next round of edits builds on the last.
Constraint discipline, in practice
Five rules that compress the whole thesis into a checklist you can apply on Monday.
First, every prompt is a file in a repo. If you can't git blame it, you can't debug it. UI-only prompt edits are technical debt with a delayed bill.
Second, every prompt declares its scope explicitly. In-bounds work. Out-of-bounds work. Recency rules: don't touch files modified in the last week. File-touch limits: change at most three files per run. The rules don't have to be elaborate. They have to exist.
Third, every prompt change is a pull request. Reviewed, diffable, revertible. A two-line edit to a system prompt changes downstream behaviour as much as a two-line code change. Treat it the same.
Fourth, every agent run produces a journal entry the next run can read. Cross-run memory belongs in the repo, not in the model's context. The model's context is volatile. The repo is durable.
Fifth, every claim that "the model failed" is investigated as a prompt regression first, a model regression second. Most of the time the prompt drifted. Rule out drift before you rule in a model problem.
git blame your AI behaviour, you can't debug it.What this gets you
The ROI of constraint discipline isn't that the model gets smarter. The model is whatever it is. The ROI is that the system becomes one you can reason about. You can ship an agent that works, leave it for a month, come back, and either find it still works or know exactly which commit broke it. You can swap models when a better one ships. You can trust the cron loops to do their job between sessions. You can hand the system to a new engineer and they can read the rules instead of inheriting tribal knowledge.
That is the intelligence advantage worth chasing: not bigger models, but better-constrained systems around them.
Reliability isn't a property of the model. It is a property of the execution environment you build around it.

