Skip to main content
ai tools

The Best AI Workflow Still Ends With a Human Review

AI gets a lot right. It also misses things a human catches in 30 seconds — edge cases from domain knowledge, whether the solution actually solves the problem.


5 min read

AI coding agents are fast. They're often correct. They write idiomatic code, handle error cases, follow conventions when you tell them the conventions. And they consistently miss a category of problem that doesn't show up in the prompt: the context you forgot to include, the edge case that comes from knowing your users, the question of whether the solution actually solves the right problem. That category is not shrinking as models get better. It requires a human.


What AI Gets Wrong That Humans Catch Fast

The failure mode is specific. An AI coding agent works from the context you provide. If your prompt is complete, accurate, and captures all the constraints — you get good output. The problem is that prompts are never complete. You know things about your system that you didn't think to mention because they feel obvious. The model doesn't know what you know. It fills the gaps with plausible defaults.

Context that wasn't in the prompt. You ask the agent to add pagination to a list component. It adds pagination using offset-based queries. You use cursor-based pagination everywhere else in the codebase — but you didn't say that, and the agent didn't infer it from reading two unrelated files. Now you have two pagination patterns. A human reviewer who knows the codebase catches this in the first ten seconds of reading the diff.

Edge cases from domain knowledge. You ask the agent to write a discount calculation function. It handles the happy path correctly. It doesn't handle the case where a user has two overlapping discounts, because that's a business rule that lives in your head and your product spec, not in the code the agent read. A human who has talked to sales for six months knows that case exists.

Whether the solution solves the right problem. This is the most expensive miss. The agent solved the problem you described. The problem you described isn't actually what users are running into. A human reviewer — especially one close to the product — can sometimes catch this. The review question isn't just "does this code work" but "does this code solve what we're actually trying to solve."

The Specific Review Steps That Matter

A good review of AI-generated code has a different focus than a good review of human-generated code. Human code reviews look for bugs, style issues, and missed cases. AI code reviews should do that too — but they should also check:

Read the full diff, not the summary. The agent's summary describes its intent. The diff describes what it actually did. These diverge more than you'd expect. Read every changed file.

git diff main --stat     # scope check: did it touch things it shouldn't?
git diff main            # line by line: what exactly changed?

Run the tests yourself. Don't accept "tests pass" from the agent's output. Run them in your environment. The agent may have run tests against a different state than the current state, or may have run a subset.

Check the UI visually. If the change touches anything rendered, open it in a browser. The model doesn't see the UI. You do. This takes two minutes and catches layout problems that no automated test surfaces.

Ask: what did this assume? Every agent output makes assumptions. What data shapes did it assume? What user flows did it assume won't exist? What scale did it assume? Name the assumptions and check them against reality.

The Mistake of "AI Did It, Therefore It's Probably Fine"

This is the failure mode that's most worth naming directly: the assumption that because AI is fast and often correct, the output requires less scrutiny than human-written code. The opposite is closer to true.

Human engineers have context you don't have to articulate. They've been in the design review. They know why the last attempt at this feature was reverted. They know the user who complained about this last month. That embedded context shapes what they write and what they consider.

The agent has none of that unless you put it in the prompt. So AI-generated code is produced with less accumulated context than code from a senior engineer who's been on the project. It deserves more scrutiny at review time, not less.

Where Autonomy Is Appropriate

None of this argues against agent tooling. It argues for calibrating trust to the task.

Appropriate for high autonomy: scaffolding, boilerplate, test generation for known-good code, documentation, migration scripts for well-specified transformations, refactoring within a single file with clear criteria.

Requires human review before shipping: any change that touches core business logic, API contracts, authentication or authorization, data models, anything a user will directly interact with.

Always decide yourself: the question of whether to build the feature at all, the approach to a problem where multiple trade-offs exist, anything that involves product judgment about what users actually need.

The agent is a fast, capable collaborator that works from the information you give it. The human review is where you bring the information that wasn't in the prompt. Both are necessary. Neither replaces the other.