WooCommerce CRO Technique

Does my WooCommerce store have enough traffic to A/B test?

This technique sizes a WooCommerce A/B test before launch by taking the real baseline rate from GA4, choosing a minimum detectable effect that would genuinely matter to the business, and calculating the per-variant sample and run-length in advance.

Summary

Bottom Line: Do not launch a WooCommerce A/B test until you know the baseline, the minimum detectable effect, and the per-variant sample you need to reach.

  • Sample size gets expensive very quickly as the effect gets smaller. In this setup, required sample is roughly proportional to 1 / MDE², so halving the target lift from 20% to 10% needs about 4× the sample, not twice the sample.
  • Higher-frequency ecommerce actions such as add_to_cart or begin_checkout are much cheaper to test than purchase, because their baseline rates are usually higher. GA4 explicitly supports all three events.
  • Write down the hypothesis, baseline, MDE, sample per variant, planned duration, and stop rule before launch. On a normal fixed-horizon test, checking significance early and stopping early inflates false positives.
  • If the maths says the test will take many months, the answer is usually not “run it anyway”. The better answer is to change the metric, widen the eligible audience, or accept a larger swing worth detecting.

How To Implement

  • Define the exact WooCommerce surface and the exact primary metric

    Write down whether you are testing a single product template, a category archive, the cart, the checkout, or a sitewide/lading-page surface. If it is cart or checkout, note whether the store is using Cart & Checkout Blocks or the classic shortcode pages, because the implementation surface is different even though the sample-size maths is the same. WooCommerce’s own docs show that Cart and Checkout blocks are edited in the editor, can be transformed to classic shortcodes, and have different extensibility rules from the shortcode flow.

  • Check that the metric is actually measured in GA4 and available on the tested surface

    For retail stores, GA4’s ecommerce model supports add_to_cart, begin_checkout, and purchase. If you cannot trust the event, you cannot trust the baseline. Measurement note: if you mark a new event as a key event today, that changes reporting from the time of creation and does not backfill historic data, so do not size a test from a partial or newly-created key event history.

  • Pull the real baseline from GA4 using the same denominator your testing setup uses

    For page-entry tests, use the Landing page report and filter to the actual page path or path pattern; GA4 classifies this report as session-scoped and lets you add dimensions such as Session source / medium. For step-specific journeys, build a Funnel exploration in Explore → Funnel exploration. If you need key-event counts by surface, GA4 reports and explorations support Key events and Session key event rate.

  • Keep the denominator consistent across GA4 and your test tool

    GA4 exposes both sessionKeyEventRate:event_name and userKeyEventRate:event_name. Nelio, by contrast, documents its test conversion rate as conversions ÷ page views after variant assignment. If you size from a GA4 session-based baseline but judge results in a page-view-based tool metric, you are mixing units and your sample estimate will be wrong. If you cannot align them exactly, note the mismatch in the test brief and use the same denominator all the way through decision-making.

  • Choose the smallest relative lift that would genuinely matter to the business

    This is your MDE. Use a relative value, not an absolute percentage-point wish. In practical terms, the question is not “what uplift would be nice?”, but “what is the smallest swing worth shipping, QA’ing and keeping?” VWO’s help explicitly notes that smaller detectable effects require more precision and therefore more sample.

  • Calculate the per-variant sample before launch

    For a two-variant equal split, two-sided 5% alpha and 80% power, the standard large-sample two-proportion formula simplifies to the very usable approximation n ≈ 15.7 × (1 − p) / (p × MDE²) per variant, where p is the baseline rate as a decimal and MDE is the relative lift as a decimal. Worked examples:

    • p = 0.02, MDE = 0.10 ⇒ about 76,930 per variant by approximation, which is close to the ~80k territory returned by exact calculators.
    • p = 0.02, MDE = 0.20 ⇒ about 19,233 per variant by approximation, roughly the ~20k rule of thumb.
    • p = 0.10, MDE = 0.10 ⇒ about 14,130 per variant, which shows why add-to-cart tests are often far more feasible than purchase tests.

    You can run this in a standalone calculator such as Evan Miller, or in a vendor calculator such as VWO. VWO’s public calculator is explicit that its main version is tied to its enhanced SmartStats engine and links to a separate classic calculator.

  • Convert the sample into a planned run-length and write it down

    Divide required visitors per variant by the page’s expected eligible visitors per variant per day, not by total site traffic. Then write down the sample, the expected duration, and the stop rule in the brief before the test starts. VWO’s calculator and help centre both frame sample size and duration together for exactly this reason.

  • If the duration is impractical, redesign the test before you launch

    The usual moves are: test a higher-frequency metric such as add_to_cart or begin_checkout; test a broader surface with more eligible traffic; or accept a larger MDE. What usually does not work is launching a low-baseline purchase test and hoping the maths becomes kinder later.

  • Launch with guardrails, not with “wait and see”

    On a normal fixed-horizon design, only read the main result when the pre-set sample is reached or a pre-declared stop condition fires. If you are using Nelio, remember its Required Sample Size and Required Confidence settings are thresholds in the UI, not a substitute for store-specific power planning. If you are using VWO, be explicit whether the analysis is fixed-horizon or sequential/SmartStats.

  • QA the split and event firing during the test, but do not call winners early

    If the allocation looks off, treat it as a data-quality problem first. Microsoft’s experimentation team documents sample ratio mismatch as a real failure mode caused by assignment, execution, logging, or biased analysis issues.

How To Measure

The KPI for this technique is simple: does the experiment reach its pre-registered per-variant sample within the planned window and return a decision you can trust? The live business KPI then depends on the test itself: usually conversion rate or RPV for purchase-led tests, AOV for basket-value tests, or checkout completion for cart/checkout tests. In GA4, use the ecommerce events that match the hypothesis — add_to_cart, begin_checkout, and purchase — and read them in a filtered Landing page report for entry-page tests or a Funnel exploration for step-based journeys. Success is not “the graph looked promising”; success is that the test reached the planned sample and either produced a credible decision or was stopped for a pre-declared guardrail.

Read the result in the same segment you used for sizing: the same landing page or template, the same device mix where relevant, and the same acquisition mix if the experiment is traffic-source-specific. GA4’s Landing page report supports session-scoped analysis and secondary dimensions such as Session source / medium, which is useful if the eligible audience is not evenly distributed.

Guardrail metrics must not get worse. If the primary KPI is conversion rate, keep RPV and AOV in view so you do not “win” on orders while losing value. If the test touches cart or checkout flows, keep checkout completion as a guardrail or primary KPI where appropriate. If the experiment changes front-end rendering, include LCP, INP and CLS because Google defines Core Web Vitals around loading, interactivity and visual stability, and recommends good scores for search success and user experience.

Pitfalls

  • Mistake: mixing denominators across tools. GA4 can report session-based or user-based key event rates, while Nelio’s test conversion rate is conversions divided by page views. A clean-looking sample estimate can still be wrong if the baseline and the reporting denominator do not match.
  • Mistake: chasing a tiny uplift on a low-baseline metric. At a 2% purchase baseline, moving from a 20% MDE to a 10% MDE does not double the sample need; it roughly quadruples it. That is why low-traffic stores often get far more value from testing add_to_cart or begin_checkout first.
  • Myth: you can just stop when one variant “looks significant”. In a fixed-horizon test, naïve peeking and early stopping raise the false-positive rate. If you want optional stopping, you need a sequential method that explicitly supports it.
  • Myth: if you leave the test running long enough, the tool will eventually reveal the winner. Not necessarily. Long duration does not repair a badly sized test, and VWO explicitly distinguishes between fixed-horizon and sequential approaches because the analysis rules are different.
  • Edge case: large GA4 numbers are not always exact counts. Google states that Active Users and Sessions in GA4 are approximated with HyperLogLog++ at scale, so on very large properties, exact exports or BigQuery may be better for baselines where precision matters.

Examples

FAQs

Sources & Further Reading

Want us to implement this for you?

We run measured CRO consultancy for WooCommerce. If you want help prioritising, testing & implementing these improvements, tell us about your store.

Book Pilot

About This Page

  • Written By: Eliot Webb – Founder & WooCommerce CRO Consultant
  • Last Reviewed: 18 Jun 2026
  • Last Updated: