Prompt Engineering

Prompt Engineering Isn't Dead. It Became the Cheapest Part of Your Harness. (June 2026)

Published on June 6, 2026·10 min read

Every few months the discipline gets renamed and declared obsolete. In 2023 the job was prompt engineering, and Andrej Karpathy could say "the hottest new programming language is English." By 2025 the consensus had moved on: Shopify's Tobi Lütke argued that "context engineering describes the core skill much better," because what you put in the context window mattered more than how you phrased the ask. In 2026 the frame shifted again, to harness engineering, captured by Mitchell Hashimoto's rule for agents: "every time the agent makes a mistake, change the system so that mistake structurally cannot recur" [1].

Each rename arrives with the same subtext: the old skill is dead, learn the new one. I think that reading is wrong, and acting on it costs you. Prompt engineering did not die. It got demoted from "the whole job" to one component of a larger system, and it happens to be the cheapest, most portable, highest-leverage component you own. If you are a solo dev or a small team, that is the good news, and almost nobody is framing it that way.

I want to make this concrete, because "harness engineering" sounds like an enterprise discipline that needs a platform team, and at your scale it absolutely does not.

The three eras, and what actually changed

It helps to see the progression as one arc rather than three fads [1]:

Prompt engineering (2022 to 2024): "What should I say?" The belief was that instruction quality decided the outcome. Chain-of-thought, ReAct, few-shot examples. You hand-tuned wording and hoped.
Context engineering (2025): "What should I show it?" Once models got tool use, memory, and retrieval, the wording mattered less than what filled the window: the right documents, the right history, the right tool descriptions, structured so the model could use them.
Harness engineering (2026 onward): "What system should I build around it?" Agents loop, call tools, and fail in ways a single prompt cannot fix. The work moved to the scaffolding: the guardrails, the checks, the feedback loops that catch a mistake and make it impossible next time.

The evolution from prompt engineering to context engineering to harness engineering, each era wrapping the last, with the prompt as the cheapest and most portable component you own

Notice what did not happen at any step: the previous skill did not disappear. Engineering rigor relocated. This is the exact pattern software has always followed, from design docs to type checkers to automated test suites. Each layer did not replace the last; it wrapped it.

The useful definition of a harness is the blunt one: the harness is everything in the agent except the model [1]. The model is the rented brain. The harness is everything you build around it so the rented brain produces reliable work. And that "everything" sorts into four boxes:

Deterministic feedforward: rules you set before the run. A .cursorrules file, a system prompt template, a style guide the model must follow.
Deterministic feedback: machines that judge the output objectively. Compilers, linters, type checkers, test suites.
Non-deterministic feedforward: the instructions and constraints in natural language. This is where prompts live.
Non-deterministic feedback: an LLM judging output an LLM produced. Scoring, evals, the "is this actually good" check that no compiler can make.

Your prompts can improve. Promptimizer rewrites and auto-tests them for you.

Try it free

You already run a harness. You just did not call it that

Here is the part the enterprise framing hides from solo devs. You do not need to adopt harness engineering. You are already doing it, probably accidentally.

If you have a .cursorrules or a CLAUDE.md in your repo, that is deterministic feedforward. If your CI runs a linter and a test suite before you merge agent-written code, that is deterministic feedback, and it is the single most powerful guardrail you have, because it is objective and free. If you keep a reusable prompt you paste into every code review, that is non-deterministic feedforward. If you eyeball the output and re-run when it is weak, that is a crude, manual version of non-deterministic feedback.

So the question for a small team is never "should I build a harness." You have one. The only question is whether it is deliberate or accidental. An accidental harness is the prompt living in three files, the lint rule nobody enforces, the eval you run in your head and forget. A deliberate harness is the same components, made explicit, versioned, and reused. The gap between the two is not budget. It is discipline, and discipline is the one input a solo dev has as much of as a platform team.

The cheapest box to make deliberate is the prompt

Look at the four quadrants again and ask where your money and time go. Deterministic feedback, the compilers and linters, you mostly get for free from your toolchain; they already exist and they already work. Non-deterministic feedback, real evals, is the expensive corner: building a representative test set and scoring against it takes effort, and most small teams under-invest here.

But the highest return for the least cost sits in the feedforward column, and especially in the prompt itself. A prompt is cheap to write, cheap to change, and it shapes everything downstream. It is also the component that travels with you when the model underneath swaps out, which, as I have argued before, now happens roughly every two weeks. The model is rented and gets replaced constantly. The compiler belongs to your language. The prompt is the part of the harness that is genuinely yours, portable across every model you will ever route to.

That is why "prompt engineering is dead" is exactly backwards. In a world where the model is a commodity and the harness is the moat, the prompt is the load-bearing piece of the moat that you actually own and carry forward. Demoting it to "the whole job" was always too much credit. Dismissing it now is too little.

The techniques you're reading about work. Test your prompts now with Prompt Score and see your score in real time.

Test your prompts

There is one trap to name. The prompt being cheap to change is also what makes it easy to let rot. Cheap-to-change means everyone changes it, in their own copy, for their own model, and six weeks later you have five forks and no idea which one is canonical. The cheapness is a feature only if you pair it with the one rule that keeps it from becoming sprawl: a single source of truth.

A minimal harness a solo dev can actually run

You do not need a platform. Here is the smallest harness I would run on top of any LLM-backed workflow, mapped to the four quadrants so you can see the coverage. Most of it you may already have; the point is to make each box deliberate instead of accidental.

The four-quadrant harness for a solo dev: deterministic and non-deterministic feedforward and feedback, with the prompt highlighted as the one box you own and carry across models

Write down your rules once (deterministic feedforward). One CLAUDE.md or .cursorrules per project, holding the conventions the model keeps getting wrong. Every time the agent makes the same mistake, add a line. That is Hashimoto's rule applied at solo-dev scale: change the system so the mistake cannot recur.
Let the machine judge what the machine can (deterministic feedback). Lint, type-check, and test agent-written output in CI before it merges. This is the cheapest, most reliable guardrail in the whole stack, and it catches the failures a human reviewer skims past. Never let generated code in without it.
Keep prompts as intent, not model hacks (non-deterministic feedforward). Write what you want and the constraints. Keep the model-specific scaffolding, the "think step by step," the "respond in JSON, no preamble," in a clearly marked layer you can strip or swap, because that layer is tuned to one model's quirks and will not survive a swap.
Version every prompt with its target model (non-deterministic feedforward). A prompt without a note on which model it was shaped for is a liability the next time you swap backends. The version is the memory of what worked and why.
Score before you trust, and before you swap (non-deterministic feedback). Keep a small, representative set of your own inputs and check a prompt against them when you change it or change the model. "Good on a public benchmark" is not "good on your work." This is the box most solo teams skip, and it is the one that turns a model swap from a regression hunt into an afternoon.
One source of truth for every prompt. Not three repos and two chat tools. One authoritative copy you update once and trust everywhere. Scattered forks are also a security surface, as the agentic prompt-injection problem makes clear: every uncontrolled copy is another place an instruction can be smuggled in.

None of these need a vendor or a budget. Items 1 and 2 are config and CI you likely have. Items 3 through 6 are about treating your prompts as managed artifacts instead of disposable text, which is mostly a decision.

Why the rename keeps happening, and what to ignore in it

The discipline gets renamed because the frontier of difficulty moves. When models could barely follow instructions, wording was the hard part, so we named the skill after wording. When they could follow instructions but starved for the right information, context was the hard part. Now that they loop autonomously, the system around them is the hard part. The name tracks the current bottleneck, not the obsolescence of the last one.

What to ignore is the implied funeral. "Context engineering replaced prompt engineering" and "harness engineering replaced context engineering" are headlines, not architecture. In the actual system, all three coexist: you still write the prompt, you still feed it the right context, and you wrap both in a harness that checks the result. A solo dev who hears "prompt engineering is dead" and stops investing in their prompts has thrown away the cheapest, most portable asset they own to chase the expensive one.

Anthropic's own multi-agent setups separate a planner, a generator, and an evaluator precisely because an agent cannot reliably grade its own work [1]. That is a harness decision, and it is also, underneath, a prompt decision: three roles means three carefully written prompts, versioned and tested. The harness did not abolish the prompt. It gave the prompt a job to do inside a system.

The signal

The names will keep changing. Next year there will be a fourth word for it, and a fresh wave of posts declaring harness engineering dead. The arc underneath is stable: build a system around a rented brain so it produces reliable work, and keep the parts you own portable.

For a solo dev or small team, the move is unglamorous and cheap. Make your accidental harness deliberate. Let the compiler and the test suite do the objective grading for free. And treat your prompts, the one component that is genuinely yours and travels across every model you will route to, as managed, versioned, tested artifacts rather than text you paste around. The model is rented. The harness is the moat. The prompt is the part of the moat you carry with you.

Keep My Prompts is built for the non-deterministic corners of that harness: version every prompt with its target model, score it on six quality criteria before you trust or swap it, and keep one source of truth instead of forks scattered across tools. Free to start, no credit card required.

References

[1] From Prompts to Harnesses: Four Years of AI Agentic Patterns, April 5, 2026. https://bits-bytes-nn.github.io/insights/agentic-ai/2026/04/05/evolution-of-ai-agentic-patterns-en.html (synthesizes Karpathy on prompt engineering, Tobi Lütke on context engineering, and Mitchell Hashimoto on the agent harness).

#prompt-engineering#harness-engineering#context-engineering#agentic-workflows#prompt-management#solo-dev#llmops#prompt-versioning#ai-discipline#2026

Ready to organize your prompts?

Start free, no credit card required.

Start Free

No credit card required

Prompt Engineering

You Don't Need a Prompt Eval Harness Yet. Score First.

The advice says "set up evals." For most solo devs it is premature. Prompt QA has three layers, and you need the cheapest first. A three-question test for which one.

Read article →

Prompt Engineering

From 2 to 4: How We Fix a Low-Scoring Prompt One Criterion at a Time

When a prompt scores low, don't rewrite it. Fix the lowest load-bearing criterion, re-score, repeat. We take a real prompt from 2.1 to 4.0 in four passes.

Read article →

Prompt Engineering

Score Any Prompt in 3 Minutes: The 6-Criteria Rubric We Run Before Hitting Send

You judge prompts by their output. That's the bug. Here's the 3-minute, six-criteria rubric we run before hitting send, with a worked example and when to automate it.

Read article →

The three eras, and what actually changed

You already run a harness. You just did not call it that

The cheapest box to make deliberate is the prompt

A minimal harness a solo dev can actually run

Why the rename keeps happening, and what to ignore in it

The signal

References

Ready to organize your prompts?

Related articles

You Don't Need a Prompt Eval Harness Yet. Score First.

From 2 to 4: How We Fix a Low-Scoring Prompt One Criterion at a Time

Score Any Prompt in 3 Minutes: The 6-Criteria Rubric We Run Before Hitting Send