Two-Track Review with Claude and Codex — Surfacing Blind Spots with a Different AI Lineage
Table of Contents
So you don't share the same blind spots—pitting AIs of different lineages against each other
Building alone, the thing that feels most precarious is assuring quality. When the person who wrote the code reviews it, they read it carrying the same assumptions they had while writing it. And if you simply have an AI write it and another AI review it—two AIs of the same lineage slip past the same blind spots together.
What worked was a "two-track" review pitting two AIs of different lineages against each other: Claude (Anthropic) and Codex (OpenAI).
Why different lineages
AIs descended from the same training tend to be good at similar things—and weak at similar things. So a defect one misses, the other tends to miss too. Same-lineage review feels reassuring, yet the gaps line up.
When the model families differ, those gaps shift. A problem one walks past, the other snags on. Here's how it runs: first, on the Claude side, five reviewers with distinct lenses run in parallel—a setup from a public plugin called pr-review-toolkit, each looking from a different angle:
- code-reviewer: errors in the code and convention violations
- silent-failure-hunter: swallowed exceptions and silent failures
- comment-analyzer: comments that disagree with the implementation
- pr-test-analyzer: missing tests and coverage gaps
- type-design-analyzer: the design of the types
Then the same change goes through Codex, of a different lineage.
Once, this happened. We removed an argument from a database function and fixed every caller on the application side—but in the SQL tests that check the database's behavior, one old call was left behind. These five distinct-lens reviewers all missed it; only Codex, of a different lineage, caught it. What same-lineage review keeps missing in unison even as you add lenses, a difference in family scoops up—the clearest place two-track review earns its keep.
What goes to machines, what goes to the AI
Still, not all of review goes to the AI. There's a line.
Anything with a single right answer goes to machines. What holds this up is a body of deterministic checks—same input, same result, every time—run as tests and automation:
- GitHub Actions: every time you open a PR, the agreed checks run automatically—lint, type checking, build, the test suites. A gate you can't merge through until everything is green.
- pgTAP: tests the database's functions and permissions directly, in SQL. You can pin down, say, "this function must not be executable under a logged-in regular customer's privileges"—and if a change accidentally opens that permission, CI goes red and stops it.
- vitest: unit tests for screens and logic—checking that a score calculation or a display behaves exactly as decided.
These pin down facts. They don't bend to anyone's mood, or to the AI's. That's why a slip like "an old call left behind" turns CI red the moment it's missed—the different-lineage AI's findings and the mechanical gate become a double net.
Judgments of meaning, meanwhile—whether a design is sound, an overlooked side effect, how naming and responsibilities are placed—go to the AI, and ultimately a human. Get this backwards—try to measure the quality of prose or design with a proxy like "does it contain the required word"—and you'll usually miss. Form and fact to the machine; meaning to people and AI—that split is the foundation of two-track review.
Decide "done" before you implement
The trick to making that deterministic net pay off is order. Before writing a single line, you decide—in a form a machine can judge—what "done" means.
For instance: "this old pattern is gone from every file (a search returns zero)," "this condition is always true in SQL," "this test passes (exit code 0)." Writing the definition of done up front, as numbers and commands, keeps the bar from quietly slipping mid-implementation. And the reviewer only has to check whether it meets the criteria.
Anywhere a contract is involved—the shape of the exchange between screen and server, the definition of a database column—we go a step stricter: before starting, verify the real thing directly, write it down as the single source of truth, and only then implement. Hand a spec written from guesswork straight to implementation and the discrepancy propagates downstream. It's unglamorous, but this cut rework more than anything.
In practice, it doesn't end in one pass
Honestly: review doesn't finish in a single round.
For changes that pin down a spec or a design, it's not unusual to go a dozen-plus rounds before findings converge to zero. It doesn't end in one pass because each fix on one side gives rise to the next point of contention. Fix it → another seam shows → fix it again. A plodding back-and-forth—but precisely because we can run it without flinching, one person can bring a team's worth of eyes to a change.
(Why you can run that much review without guilt comes down to the cost structure of development-side AI—a topic for a later post.)
Surfaced by lineage, not by person
Two-track review works because it surfaces defects by difference in lineage, not by who's superior or inferior—and over that, the mechanical gate of deterministic tests and CI lays a second layer. Defects get found, and pinned down, half a step away from both the author's assumptions and any single model's quirks. A realistic way to lift the quality of solo development without adding hands—that's what we found.
How we assembled this way of building over a year is in How a Micro-SaaS Tech Stack Changed in a Single Year, and the foundation that lets an AI touch real environments is in What MCP Changed. Other posts on how we build are gathered in the dev category.
The micro-SaaS we build under this double and triple scrutiny is PentaTrail—a CTEM service that uses AI to continuously grasp your externally visible attack surface. If you're curious, take a look at your own company's "externally visible attack surface."
Visualize your attack surface with PentaTrail/CTEM
From discovery to vulnerability validation and remediation — all powered by the CTEM framework.
Get Started