Email marketing

A/B testing for email: What to test, how to test it, and how to read the results

Joey Lee

Illustration of two people comparing A and B options, representing A/B testing and experiment-driven decision making in email marketing optimization.
Illustration of two people comparing A and B options, representing A/B testing and experiment-driven decision making in email marketing optimization.

The problem with most email testing programs is that they test wrong: checking results too early, stopping at the wrong time, and confusing statistical significance with practical significance.

If you’re peeking at test results daily and calling a winner when the dashboard shows “significant,” your false positive rate is closer to 26%. That means roughly one in four of your “winning” variants isn’t actually better. You’ve been implementing noise.

Here’s how to build a testing practice that produces reliable signals.

The peeking problem

Classical hypothesis testing assumes you set a sample size in advance, run the test to completion, then check the results once. Most email teams do the opposite, checking results daily and stopping the test if they happen to notice the p-value dips below 0.05.

This is called the peeking problem, and it’s well-documented. Each time you check results, you’re running a new test on the accumulated data. Over the course of a test, the cumulative false positive rate compounds. Armitage, McPherson, and Rowe demonstrated that continuous monitoring inflates the Type I error rate dramatically. What should be a 5% false positive rate becomes 26% or higher.

Two solutions: fix your sample size in advance and don’t check until you reach it, or use sequential testing methods designed for continuous monitoring (like the alpha spending function or always-valid p-values). If your platform supports sequential testing, and both Iterable and Braze have moved in this direction, use it.

Sample size reality

Most email A/B test calculators ask for three inputs: your baseline conversion rate, the minimum detectable effect (MDE), and your desired statistical power (typically 80%). The output is the sample size per variant.

Here’s where lifecycle marketing teams run into trouble. For a typical email campaign with a 3% click-through rate and a desire to detect a 10% relative improvement (from 3.0% to 3.3%), you need roughly 47,000 subscribers per variant. For 95% confidence and 80% power. That’s 94,000 total.

If your list is 50,000, you can’t run that test. Your options:

  • Increase the MDE: Accept that you can only detect larger effects. A 20% relative improvement (3.0% to 3.6%) cuts your required sample to roughly 12,000 per variant. You’ll miss subtle improvements, but you’ll reliably detect meaningful ones.

  • Use Bayesian methods: Bayesian A/B testing doesn’t require a fixed sample size. It updates a probability distribution as data comes in and reports the probability that variant B outperforms variant A. More useful for small lists, but requires careful interpretation. “87% probability of being better” is not the same as “statistically significant.”

  • Aggregate across sends: If you send the same test across multiple campaigns (same subject line test, for instance), pool the results. This is legitimate as long as the audience and conditions are consistent.

Bayesian vs. frequentist: when each approach fits

Bayesian testing has gained popularity in email partly because platforms like Klaviyo and Braze have adopted it, and partly because it feels more intuitive. A statement like “92% chance variant B wins” is easier to act on than “p = 0.03.”

But there’s a misconception that Bayesian methods are immune to peeking. They’re not. If you repeatedly check a Bayesian test and stop the moment the posterior probability crosses your threshold, you inflate the decision error rate in a similar way. The math is different, but the behavioral problem is the same.

Use frequentist when: you have sufficient sample size, can commit to a fixed duration, and need a clean, defensible result, particularly for high-stakes tests where you’re making a permanent template change.

Use Bayesian when: your list is smaller, you need to make faster directional decisions, or you’re testing within automated flows where data accrues continuously and a hard stop date isn’t practical.

What to test beyond subject lines

Subject line tests are where most teams start and, unfortunately, where most stop. Subject lines matter, but they’re only one lever, and they’re among the noisiest to test because open rate data is unreliable post-Apple Mail Privacy Protection.

A more productive testing priority framework:

High impact, lower noise: CTA copy and placement, offer structure (percentage vs. dollar discount, free shipping vs. discount), content block ordering (product-first vs. story-first), and send time.

Medium impact: From name (brand name vs. personal name vs. department), preheader text, email length (short and punchy vs. long-form), and number of CTAs.

Lower impact but worth testing: Image vs. no image, social proof placement, urgency language, and personalization depth.

Prioritize tests that affect click-through rate and conversion rate over open rate. Post-AMPP, open rates are inflated and unreliable as a decision metric, a reality that also affects how intelligent inboxes rank your content.

Testing in flows vs. campaigns

Campaigns and automated flows have fundamentally different testing dynamics, and conflating them is a common mistake.

Campaigns are one-off sends to a defined audience. You know the audience size in advance, can calculate the required sample, and have a clear start and end. Classical A/B testing works well here.

Flows accrue data continuously. A welcome series A/B test might get 50 new entries per day. You don’t have a fixed audience; you have a stream. This changes the statistics. Fixed-horizon tests don’t apply cleanly. Sequential testing methods or Bayesian approaches are better suited.

The other challenge with flow testing: subscriber composition shifts over time. The people entering your welcome flow in January may differ from those entering in March (different acquisition channels, different seasons, different promotions). If you run a flow test for six months, the early results and late results are measuring different populations. As Optimizely’s stats engine documentation notes, this population drift is a real threat to validity. Set a review cadence (monthly or quarterly) to assess whether flow test results still hold.

How long to run tests

The minimum duration for any email A/B test is two full weeks, regardless of when you hit statistical significance. Here’s why: email engagement varies by day of week. A test that starts on Tuesday and ends on Thursday has only captured weekday behavior. If your subscribers behave differently on weekends (and they do), your results are biased.

Two weeks captures two full cycles of day-of-week variation. For flow tests, the minimum is longer, typically four to six weeks, to accumulate sufficient sample size at the flow’s entry rate.

Reaching significance early is not permission to stop. Early significance is often driven by novelty effects or sample composition artifacts that wash out over time. Pre-commit to a duration and honor it.

Interaction effects and simultaneous tests

Running two A/B tests on overlapping audiences at the same time corrupts both results. If you’re testing subject line A vs. B on one campaign and CTA copy C vs. D on another, and both campaigns hit the same subscribers, the CTA test results are confounded by which subject line each subscriber saw.

Three ways to handle this:

Sequential testing: Run one test at a time. Simplest, but slow.

Audience isolation: Split your audience into non-overlapping test cells. Subscriber pool A sees subject line test; subscriber pool B sees CTA test. Reduces sample size per test but eliminates interaction effects.

Multivariate testing: Test all combinations simultaneously (A+C, A+D, B+C, B+D). Requires 4x the sample size but reveals interaction effects. You might discover that subject line B only outperforms when paired with CTA D.

Platform testing capabilities

Capability

Klaviyo

Braze

Iterable

Customer.io

Campaign A/B

Yes, up to 8 variants

Yes, multivariate

Yes, multivariate

Yes, 2 variants

Flow/journey A/B

Yes, branching

Canvas experiments

Yes, workflow splits

Yes, A/B branching

Statistical method

Bayesian

Frequentist + confidence

Bayesian + frequentist

Frequentist

Auto-winner

Yes, configurable

Yes, winning variant

Yes, configurable

Yes

Holdout groups

Manual via segments

Native canvas holdouts

Native experiment holdouts

Manual via segments

Incrementality testing: the test that matters most

Individual A/B tests tell you which variant is better. Incrementality testing tells you whether your emails are driving any additional value at all.

The design is straightforward: take a random subset of your audience (10–15%) and suppress all email to them for 30–90 days. Continue sending to the remaining 85–90% as normal. At the end of the period, compare conversion rates, revenue, and engagement between the groups.

The difference is your email program’s true incremental lift. For most programs, this number is somewhere between 5% and 25%, significantly lower than what last-touch attribution reports suggest.

This is the test that justifies your email program’s existence to leadership and exposes the real contribution of email in your channel mix. It’s also a healthy complement to your QA practice, because testing what you send is as important as testing whether it renders. Run it annually, at minimum.

Building a testing roadmap

Level 1, ad hoc: Testing happens when someone remembers. No documentation, no pre-registration, no sample size calculation. Results are anecdotal.

Level 2, structured: Tests are pre-registered with a hypothesis, sample size target, and duration. Results are documented and shared. One test runs at a time.

Level 3, systematic: A testing backlog is maintained and prioritized by expected impact. Multiple tests run simultaneously with audience isolation. Guardrail metrics (unsubscribe rate, complaint rate) are monitored alongside primary metrics.

Level 4, optimized: Testing feeds a learning repository. Past results inform future hypotheses. Incrementality tests run annually. Multivariate testing and ML-driven send time optimization supplement manual testing.

Most lifecycle teams operate at Level 1 or 2. Moving to Level 3 is where the compounding returns begin, because each well-designed test builds knowledge that makes the next test more productive.