On June 1, 2026, xAI opened Grok Build 0.1 on its API: no waitlist, no SuperGrok or X Premium subscription, 25increditswhenyousignupatconsole.x.ai[1][2].ItisthesamemodelthatpowerstheGrokBuildCLI,nowcallablewithanAPIkey.Thenumbersthatmatteraretheboringones:1 per million input tokens, 2permillionoutput,0.20 per million on cached input, served at 100-plus tokens per second, with a 256K context window and native Model Context Protocol support [1].
That is not a frontier-model announcement, and that is exactly why it matters. The headline race is about who tops SWE-Bench this month. The thing that actually changes how a solo dev works is quieter: the cheap, fast tier of coding models just crossed the line from "fine for toys" to "good enough to run real agentic work," and the economics now argue for flipping your default. Reach for the cheap fast model first, and escalate to the expensive one only for the hard 10%.
I want to make that concrete, because "use the cheap model" sounds like a corner-cutting tip and it is actually an architecture decision.
What shipped
Strip the launch copy and here is the developer-relevant core [1][2]:
It is on the open API. Grok Build 0.1, the model behind xAI's Grok Build CLI, is now directly callable. Sign up, get $25 in credits, send requests. No consumer subscription gate.
It is priced to be looped.1/Min,2/M out, and crucially $0.20/M on cached input, roughly 85% off the cache-miss rate. Repeated context (your system prompt, your repo map) is cheap to send again and again.
It is fast. 100-plus tokens per second. Speed is not a vanity metric in an agent loop; it is the difference between a tool you wait on and a tool you iterate with.
It is MCP-native. You declare MCP servers right in the tools array with "type": "mcp", so plugging in your own knowledge bases and internal APIs is a first-class feature, not a bolt-on.
It is aimed at agentic coding. Web development, debugging, multi-step coding tasks, the work where the model runs in a loop rather than answering once.
None of these figures are mine: they come from xAI's announcement and the reporting around it, and I am citing no KMP data here.
Default down, escalate up: route the bulk of your coding volume to the cheap fast model Grok Build 0.1 (1/2 per million tokens, 100+ tokens per second, MCP-native, 256K context), and reserve the expensive frontier model for the hard 10%, letting your own tests draw the line
Want to know how effective your prompts are? Prompt Score analyzes them on 6 criteria.
The real story is price times speed, not the leaderboard
Here is the shift people miss because they are staring at benchmark tables. When a capable coding model costs 1/2 and runs at 100-plus tokens per second, you can afford to run it in a tight loop: generate a change, run the tests, read the failure, fix, retry, three or five times, and still pay less than one careful call to a frontier model. Cheap and fast is not a worse version of expensive and slow. In an agent loop it is a different and often better shape, because the loop, not the single answer, is what produces working code.
The expensive frontier models are extraordinary at one-shot hard reasoning, and they are painful to loop, both on latency and on bill. The cheap tier inverts that. So the question stops being "which model is smartest" and becomes "which model is smart enough to be worth looping on this task," and for a large slice of everyday coding the answer is now the cheap one.
This is also why the cached-input price matters more than it looks. In an agent loop you resend the same context, the system prompt, the repo structure, the conventions, on every turn. At $0.20/M that repetition is nearly free, which means the loop architecture xAI is pricing for is the one they expect you to actually use. They built the price list for agents, not for chat.
The catch: cheap is not the same as good on your code
The temptation is to read "$1 model, 70.8% on SWE-Bench" and route everything to it. Slow down. That 70.8% is self-reported on xAI's own internal harness [1], which makes it a starting hypothesis, not a verdict, and SWE-Bench is not your codebase regardless of who measured it. A model that is excellent on public Python tasks can be mediocre on your particular stack, your conventions, your weird legacy module.
So the cheap-by-default strategy only works if you know where the cheap model breaks. That is an evaluation problem, and it is the part most people skip. Before you route a class of work to Grok Build, run your own representative tasks through it and check the output the way you would check a junior's: does it pass the tests, does it match the conventions, does it hold up on the gnarly cases, not just the clean ones. Let your test suite be the judge, which is the same discipline I argued for in the harness piece: the cheapest, most reliable guardrail you have is the compiler and the tests, and they do not care which model wrote the code.
The techniques you're reading about work. Test your prompts now with Prompt Score and see your score in real time.
Find the line. Above it, the cheap model is good enough and you pocket the savings. Below it, you escalate.
The solo-dev play: default down, escalate up
Route by task, not by leaderboard: send boilerplate, scaffolding, spec-driven refactors, test generation and loop work to the cheap fast Grok Build 0.1; reserve the frontier model for architecture calls, subtle bugs and the genuinely hard 10%, with the test suite drawing the line
Concretely, here is the routing I would set up the week a model like this lands:
Send the bulk, mechanical work to the cheap tier. Boilerplate, refactors with a clear spec, test scaffolding, web glue, format conversions, the high-volume low-ambiguity tasks. This is where price-times-speed pays off and where a 70-percent-class model is usually plenty.
Reserve the frontier model for the hard 10%. The genuine architecture calls, the subtle concurrency bug, the task where being wrong is expensive and one careful expensive answer beats five cheap wrong ones. Pay for intelligence exactly where intelligence is the bottleneck.
Cache the repeated context. Put your stable system prompt and repo context where the $0.20 cached rate applies, so the loop is as cheap as the price list allows.
Keep the prompt itself model-independent. You are now routing across at least two models by design, so write intent and constraints, and keep the model-specific scaffolding in a layer you can swap. A cheap model and a frontier model want slightly different handholding.
Re-test on every version bump. This is Build 0.1, an explicit beta. The behavior will move. Whatever you validated this week, revalidate when the number after the dot changes.
The point is not "Grok Build is the best model." It almost certainly is not, on raw capability. The point is that for the majority of a solo dev's actual coding volume, best is not what you are buying, good-enough-cheap-and-fast is, and that tier just became real.
This is the same instinct good engineers already have about compute: you do not run every job on the biggest instance, you match the machine to the task and reserve the expensive hardware for the work that needs it. Models are becoming the same kind of resource. Treating the frontier model as your default for everything is starting to look like running a batch script on a GPU cluster because it was the box you happened to have open.
One more signal: MCP at version 0.1
It is worth noting what xAI chose to put in a v0.1. Native MCP support, declared in the tools array, shipped on day one of the public API [1]. A year ago, tool and protocol support was the thing you waited several versions for. Now it is table stakes, present in the first public build of a budget coding model. The agentic plumbing has standardized fast.
That is good for builders and it carries the usual caveat: the moment your cheap coding model is wired to internal knowledge bases and proprietary APIs over MCP, it inherits the agentic prompt-injection surface I wrote about. Cheap to call does not mean cheap to secure. Wire it up with the same care you would give the expensive one.
The signal
The frontier race will keep taking the headlines, and it should: those models do things nothing else can. But for a solo dev deciding where the next month of API spend goes, the more consequential event of early June was not at the top of the benchmark. It was a 1,fast,MCPโnativecodingmodellandingonanopenAPIwith25 of credits attached.
Default to the cheap fast tier for the volume work, escalate to the frontier model for the hard minority, and let your own tests draw the line between them. The expensive model is for the problems that are actually hard. Most of your coding is not, and you no longer have to pay as if it were.
Keep My Prompts lets you keep one version of each prompt, score it on six quality criteria, and compare how a cheap model and a frontier model handle the same task on your own inputs, so you can route by evidence instead of by benchmark. Free to start, no credit card required.