Engineering

April 20, 20266 min read

How to Test Every Pull Request Automatically

Most teams check how code is written. Few check if the product still works. Here's how to make behavioral validation the default for every pull request — without writing a single test script.

Evan MarshallFounder, Ito

Your linter says the style is perfect. Your unit tests say the functions are logical. Your senior dev says the architecture is sound!

Then you merge, and the checkout button stops working.

This is the "merge and hope for the best" anti-pattern. Most engineering teams have world-class pipelines for checking how code is written, but almost no infrastructure for checking if the product still works before the PR is closed.

If your quality strategy relies on a manual spot-check in a staging environment after the merge, you aren't testing — you're just delaying the inevitable fire drill. Here's how to implement automated PR testing that actually scales.

The problem: Most PRs ship without behavioral testing

Code reviews are excellent for catching logic errors. Linters are great for style. Unit tests are essential for function-level verification. But none of these tell a reviewer if the user can still complete a core flow.

Industry research shows that the majority of production incidents trace back to pull requests that were technically sound, but behaviorally broken. Without behavioral testing on pull requests, your reviewers don't get the full story. They can see that the code change on line 42 looks clean, but they have no way of knowing it impacted a user flow on the frontend.

What "testing every PR" actually means

A common mistake is trying to run your entire legacy E2E suite on every pull request. That's a recipe for a manual testing bottleneck: it's too slow, too flaky, and it trains developers to ignore CI results.

True pre-merge testing requires four specific capabilities:

Impact analysis: Identifying exactly which user flows are affected by the diff so you don't waste time testing unrelated features.
Isolated deployment: Spinning up ephemeral test environments (sandboxes) for that specific branch.
Targeted execution: Running only the behavioral flows that matter within the 10-minute review window.
Contextual reporting: Posting results — including video and reproduction steps — directly where the conversation is happening: the PR comment.

Why traditional approaches don’t cut it

Most teams try to solve this with one of three methods. All of them eventually break:

Running Playwright or Cypress in CI. The most common path. It works until you hit 50 tests. Then the "maintenance tax" kicks in: every UI change breaks a selector, CI turns red, and engineers spend their mornings fixing tests instead of shipping features.
Manual QA review per PR. This simply doesn't scale. You either hire an army of testers or — more likely — QA becomes the bottleneck that delays every release by 48 hours.
Preview environments with spot-checking. You spin up a Vercel or Heroku preview link and hope a developer or PM clicks around enough to find the bugs. It's inconsistent, unrecorded, and relies entirely on human memory.

The agentic approach to PR testing

This is where Ito changes the math. Instead of writing and maintaining scripts, we use a QA agent that understands your application.

When a PR is opened, Ito doesn't wait for a manual trigger. The agent:

Reads the diff to infer the intent of the change.
Maps the change to the behavioral flows it impacts.
Deploys a sandbox of your app automatically.
Executes the flows in a real browser, navigating like a user.
Posts the evidence — video, screenshots, and test failure reproduction steps — directly to the PR.

What an Ito PR report looks like

Start testing in 5 minutes

Doing QA on every PR shouldn't be a month-long infrastructure project.

Install the Ito GitHub app. Connect Ito to your organization in one click.
Connect your repository. Tell Ito which apps need coverage.
Configure your environment. Use an ito.yaml to define your preview environment URL or staging secrets.
Open a PR. Ito immediately begins the analysis and execution.

No test cases to write. No selectors to maintain. The agent learns your app as you build it.

What changes when every PR gets tested

When you move from post-merge firefighting to automated PR testing, the culture of the engineering team shifts:

Reviewers gain confidence. They no longer have to guess the side effects of a change. The behavioral evidence is right there in the comment.
The QA bottleneck disappears. Since regression testing automation happens pre-merge, the "QA phase" is no longer a separate, slow step at the end of the sprint.
Engineering velocity increases. When you catch a bug in a PR, it takes minutes to fix. When you catch it in production, it takes days. By catching it early, you reclaim 20–40% of your team's capacity.

The metrics of pre-merge testing

Code review tells you the change is well-written. Ito tells you the product still works. That's the gap we built Ito to close — and it's why behavioral PR testing belongs in the same pipeline as your linter and your code reviewer, not as a separate, slower step after the fact.

Frequently Asked Questions

How long does it take to test a pull request with Ito?

Most PRs are tested within minutes. Ito analyzes the diff, deploys a sandbox, and executes only the relevant flows so feedback lands inside the review window — not hours later.

Does PR testing replace code review?

No. It complements it. Code review is for architecture and maintainability; Ito is for behavioral validation. Reviewers use Ito's reports to see the effect of the code they're reading.

What happens when Ito finds a bug in a PR?

Ito posts a detailed failure report with a video recording, annotated screenshots, and the exact steps to reproduce the bug. The developer can fix the issue and push an update before a reviewer even opens the PR.

Sources

Microsoft Research (2021). How Long Will it Take to Mitigate this Incident for Online Service Systems? — A study of 20 large-scale Microsoft systems exploring the "semantic gap" and the high cost of incident mitigation.
DORA (2025). DevOps Research and Assessment 2025 Report — The industry-standard benchmark for the relationship between automated behavioral testing and lower Change Failure Rates (CFR).
arXiv (2026). Understanding Bug-Reproducing Tests: A First Empirical Study — Analysis of real-world Python systems demonstrating why traditional tests often fail to capture the specific behavioral sequences that cause production bugs.

Related resources.

Engineering

May 5, 2026 • Evan Marshall

Your AI-scaled engineering org needs big-org processes

When developers are 3–5x more productive with AI, your org is effectively that much bigger. Your operations need to follow suit.

Test

April 28, 2026 • test author

test 04 28

testing description

Guide

April 15, 2026 • Evan Marshall

What is agentic QA? The complete guide

How autonomous AI agents are replacing brittle E2E scripts with behavioral testing that actually validates the user experience.

Your first PR tested within 60 minutes.

Connect your repo and Ito starts testing pull requests right away. Each PR includes a full QA report with video, screenshots, and failure details directly in the PR.

Get Started

no credit card required