Skip to main content
ai tools

The Screenshot Is the Missing Test for AI-Generated UI

AI code can pass unit tests while producing visually broken UI. Visual regression testing closes the gap — here's how to wire it into an AI coding workflow.


5 min read

AI-generated code passes TypeScript checks. It satisfies unit tests. It often breaks the UI in ways that are invisible to both the compiler and the test runner. A component that renders at the wrong z-index, an overflow that clips on mobile, a grid that collapses when the container is narrower than the model assumed — these failures are real, they ship, and they're caught by a human looking at the screen, not by CI. Visual regression testing is the missing piece.


The Gap Between "It Compiles" and "It Looks Right"

TypeScript and unit tests verify behavior in the abstract. They don't verify that a position: fixed element is actually visible at the z-index you intended, or that your gap values hold up when the parent is narrower than assumed, or that the responsive breakpoint you told the model to target actually works on a 375px screen.

This gap exists in all code, but it's wider in AI-generated code for a specific reason: the model is reasoning about code structure, not visual output. It knows what CSS properties mean. It doesn't see what they produce. When a human engineer writes CSS, they're usually looking at the result in a browser. The model is not. It's interpolating from training data about what patterns tend to produce what layouts.

The specific failure modes I've seen most from AI-generated CSS:

  • overflow: hidden on the wrong ancestor, clipping content that needs to scroll
  • position: absolute children escaping their intended positioning context
  • Flexbox direction assumptions that break on content that's longer than expected
  • Responsive grid columns that collapse too early or too late
  • z-index stacking that works in isolation but breaks in the component tree

None of these are caught by npx tsc or jest.

Tools That Fill the Gap

Playwright screenshots are the lowest-friction entry point. Playwright can take a full-page screenshot or a locator-scoped screenshot and save it to disk. Run this in a test, diff the output against a baseline, and you have visual regression for free with infrastructure you likely already have.

// playwright visual regression test
import { test, expect } from "@playwright/test";

test("ProductCard renders correctly at mobile width", async ({ page }) => {
  await page.setViewportSize({ width: 375, height: 812 });
  await page.goto("/components/product-card?story=default");
  
  const card = page.locator('[data-testid="product-card"]');
  await expect(card).toHaveScreenshot("product-card-mobile.png", {
    maxDiffPixels: 10,
  });
});

test("ProductCard renders correctly at desktop width", async ({ page }) => {
  await page.setViewportSize({ width: 1280, height: 800 });
  await page.goto("/components/product-card?story=default");
  
  const card = page.locator('[data-testid="product-card"]');
  await expect(card).toHaveScreenshot("product-card-desktop.png");
});

First run creates the baseline. Subsequent runs diff against it. Playwright fails the test if the visual difference exceeds your threshold.

Chromatic integrates with Storybook and adds UI review to your PR workflow. Every component story gets a visual snapshot; reviewers approve or reject changes through Chromatic's review UI. The cost is a Chromatic account and Storybook adoption, but the review workflow is well-designed.

Percy is similar — it captures screenshots and diffs them in a visual review interface. It integrates with Playwright, Cypress, and several testing frameworks.

Wiring This Into an AI Workflow

The practical pattern is: AI writes the component, you run the visual tests, review the screenshots before accepting the diff. The best AI workflow ends with a human review step — and the screenshot is what makes that review meaningful.

# after agent produces a new component
npx playwright test visual --update-snapshots  # create initial baseline
# review the screenshots manually
git add tests/visual/*.png
git commit -m "add visual baseline for ProductCard"

# on subsequent AI edits to the component
npx playwright test visual  # fails if visual output changed
# review the diff images in test-results/

The key step is the manual review of the baseline. You're not just checking that tests pass — you're checking that the initial visual output is correct. If the AI generated a layout that looks wrong, the baseline captures that wrong state and your tests will protect it. Reviewing the initial screenshot is the human check that the compiler can't do.

The Responsive Testing Gap

The failure mode that benefits most from visual regression is responsive layout. When you tell an agent "make this component responsive," it will produce CSS that is plausibly correct for a set of breakpoints. Whether those breakpoints actually look right at 320px, 375px, 768px, and 1280px requires running the code in a browser at those widths.

Add viewport-specific screenshot tests as a standard part of any AI-assisted component work. Make it a rule: if the agent touched CSS that could affect layout, add or update visual tests before the PR is approved. The screenshot is not a nice-to-have — it's the test that actually covers what the model can't verify for itself.