Stop Trusting ROAS and Start Measuring Incrementality

The Return on Advertising Spend (ROAS) tells you how many conversions the platform claims you got. Incrementality tells you how many of those would not have happened without your marketing efforts. If you optimise for the first and ignore the second, you are flying blind on actual business impact. As a marketing data analyst, your stakeholders don’t really care about ROAS, click-through rate (CTR), or view-through conversion rate (VTR). They care about whether their digital channels are creating incremental revenue, profit, and customer growth or just harvesting what would have happened anyway. Incrementality is the backbone of serious budget allocation and channel ROI decisions.

So what’s an incrementality test? An incrementality test is an experimental method used to measure the causal impact of a marketing action versus what would have happened without it. Instead of just comparing “before vs after” or “exposed vs not exposed” in raw platform data, you deliberately create a control group that is not touched by specific marketing efforts and compare its results to a treated group over the same period. By randomizing who goes into which group and holding the rest of the context as constant as possible, you isolate the incremental lift: the additional conversions, revenue, or profit that are truly generated by the marketing initiative, not by organic demand, brand equity, or other channels.

Your core mandate is therefore simple: to move the organisation from “platform performance” to “causal impact” and to choose the right incrementality method for each question. You’ll need to focus on the methods that give the most decision value for the least operational overhead. You do not need every causal tool on day one but rather to scale across your digital channels and whatever comes next. If you’re starting from a relatively “classic” performance marketing setup, a pragmatic implementation roadmap is: start with the user-level workhorses [1] as your default for measuring incrementality (conversion lift, A/B test, holdout), then fall back on cluster designs [2] when you can’t randomize cleanly at user level anymore (geo experiments, DiD), layer on optimization methods [3] once the previous basics are industrialised and trusted and only where the extra complexity is justified by the business stakes (uplift modeling, synthetic control, multivariate tests), and and finally add portfolio-level modeling [4] to steer long-term budget allocation across channels (MMM).

1. The User-Level Workhorses

Conversion lift, holdouts, user-level randomized controlled tests are your everyday workhorses. These are your mainstream incrementality tools in digital marketing you should lean on 80% of the time.

Standardize conversion lift studies for campaigns, tactics and setups ran on specific platforms.

Conversion lift studies are randomized controlled tests (RCT) at user level. It means that you are able to randomnly assign individual users to a test or control group in a way that is controlled by you and enforced by the delivery system. In practice, these experiments are often supported natively in platforms and configured directly in their experiment modules. This is the easiest path to clean incrementality answers and it feeds a simple, defensible narrative for CMOs with a high business impact.

Practical decision rule. I can randomize at user level in the platform.

Where it fits. You can use Google Ads Experiments to test an optimized PMax structure, new creatives and setups on YouTube, or any supported search and shopping configuration. You can also use Meta or TikTok A/B Tests to test new audiences, bidding strategies and funnel structures. You can even perform it on marketing automation platforms (MAP) for your emails (Marketo, Salesforce Marketing Cloud, HubSpot …).

Questions answered. Is this specific campaign really incremental vs what would we have without it? Which audiences, creatives, and funnel stages actually drive lift? Does a new bidding strategy or creative mix increase incremental installs or in-app revenue? Is this new shopping structure, new PMax setup, or new App campaign configuration incremental vs the current setup?

How to do it. Make it standard for big new campaigns, major tactics and key CRM journeys. In the platform’s experiment UI, switch to “lift” or “incrementality” mode where available. Lock the optimization model or bidding strategy upfront (targeted CPA, ROAS or conversions) and do not change it mid-test. Keep tuning rules stable: symmetric budgets across arms, no target changes in one arm only, no creative surgery halfway through. Define treatment (new setup) vs control (BAU or no-ads), plan the required sample size and test duration, let the bidding algorithm reach a steady state in both groups, then read incremental lift off the experiment results.

Deploy peristent holdouts on key channels and programs.

A holdout is a persistent fraction of your audience never exposed to a given channel or program. This persistent control is a small initial pain to set up (yes, you are explicitly sacrificing potential short-term dollars), but it delivers a large long-term upside: it quickly exposes “vanity channels” that drive no net uplift and are not worth the budget, and it gives you an always-on barometer of channel-level incremental value.

Practical decision rule. I can randomize at user level in the platform + the channel/program is always-on + I can keep a stable holdout segment with no exposure to this channel across all campaigns and tactics.

Where it fits. We can use it with CRM, emails, SMS, push, in-app messages, retargeting, brand search, and selected always-on partners (affiliates, some PMax surfaces, some display/remarketing networks) where we can reliably exclude a subset of users, accounts, or markets.

Questions answered. Does this channel, as we run it today, actually move the needle vs does nothing? Are we actually generating incremental conversions, or mostly hitting users who would convert anyway? Does this channel create net new conversions, or just shift users from organic (SEO), direct (URL) and branded traffic (SEA)? Does this partner justify their commission vs core channels? What is the net incremental value of CRM communications over the lifecycle? How much is over-sending vs effective nurturing?

How to do it. Make persistent holdouts part of governance for every major always-on program, not a one-off test. For each relevant base (active customers, app users, site visitors), randomly assign 5–10% per key segment into a holdout cohort and lock membership over time. Implement hard suppression rules so the holdout never receives that channel/program: exclusion lists in CRM from contact and cookiers, negative audiences in ad platforms, no inclusion in lookalike/expansion seeds, no backfill via other campaigns using the same IDs. Keep platform algorithms and bidding strategies unchanged for the exposed population so the only structural difference is exposure vs no exposure. Track the same KPIs for holdout and exposed users (revenue, orders, installs, engagement) and report their deltas as a core KPI on dashboards. Finally, periodically review the composition and size of the holdout to ensure it remains representative and statistically powered. Adjust only through controlled re-randomization, and treat any decision to reduce or remove holdouts as an explicit trade-off between signal quality and short-term volume.

Manually run A/B tests with a true “no-ads” control when no native lift or holdout is available.

A/B tests have the same causal logic as conversion lift, but they are implemented manually. You still create a treated group and a true control group that never sees the campaign, but you manage the split, exclusions and delivery rules yourself. This is the bread-and-butter experimental pattern when the platform doesn’t offer native lift or experiment modules, or when the native module is too limited for your design.

Practical decision rule. I cannot use native lift and experiment features + I can randomize at user level by controlling manually who is exposed and who is not.

Where it fits. Any channel, campaign, tactic or setup where you can define and target (or exclude) cohorts, but the platform doesn’t provide a proper lift experiment: some programmatic/display partners, certain affiliate or network buys, CRM systems without built-in holdouts, in-app placements, on-site experiences, even Google-Meta-TikTok when you want a custom design (firmographic clusters, specific account lists) that the experiment UI doesn’t support out of the box, and of course traditionnal campaigns.

Questions answered. Is this campaign, tactic or setup incremental vs no campaign at all on this population? Does adding this retargeting layer or this CRM journey generate net-new conversions vs letting these users convert organically? Does this specific reminder move the needle, or are we just paying for users who would convert anyway under BAU?

How to do it. Define the eligible universe (users, accounts or clusters), then randomly assign them into Group A (test) and Group B (true no-ads control) before launch. In your setup, target only Group A for the campaign and hard-exclude Group B everywhere (audience exclusions, negative lists, suppression lists), so the algorithm never sees them as eligible inventory. Same as for conversion lift, freeze tuning rules for the duration of the test: same optimisation model logic as you would normally use for this tactic, no different algorithms in test vs control, symmetric budgets on comparable populations, no last-minute targeting tweaks, no creative changes that only affect one group. Let the algorithm reach a steady state, then compare business KPIs (conversions, revenue, margin) between Group A and Group B to estimate incremental lift. If you can, run a simple lift model (difference-in-means with confidence intervals) rather than eyeballing raw deltas.

2. The Cluster Methods

These methods are less easy and robust. They are escalation paths when you cannot do proper user-level randomization. In practice, it usually means that the platform doesn’t support experiments, you cannot afford to perform a manual A/B test, or you cannot control who sees the ads (TV, radio, podcasts, outdoor adversing or OOH) and last but not least _ especially under RGPD _ you have some legal and tech constraints to reach out your prospects (no user IDs, no cookies, no consent to tag and segment people). To overcome these, you need to fall back on escalation paths such as geo experiments and DiD.

Set up geo experiment capability for both channels and campaigns.

Here, we still perform randomized experiments but at geo-level where countries, regions or cities are assigned to test or control with different budget and activation levels. It is the only realistic option when user-level randomization is not feasible or meaningful. Geo-level RCTs are more operationally heavy and less clean than user-level, but they remain critical to cover top-funnel and big offline-like channel spend.

Practical decision rule. I cannot randomize at user level + budget and activation decisions are made by market or region + new initiative that I can plan.

Where it fits. Upper funnel of cross-channel YouTube-PMax bundles where only some geos go “all-in”, large awareness media campaigns planned by country or regions in Meta or TikTok, podcasts or audio with geo split, large affiliates and emerging channels with targeted segment launches, market-level “turn on / turn off” tests for new channels.

Questions answered. If I significantly increase or reallocate budget in some markets, what is the incremental business impact? Are these awareness and consideration campaigns incrementally moving revenue, installs or qualified leads at market level vs what would have happened anyway?

How to do it. Define at that same level the unit of randomization (country, region, city) and the primary business KPI (revenue, installs, leads). Build geo clusters that are roughly comparable on size, historical performance, seasonality and mix. When possible, pair similar markets together. Within each cluster, randomly assign one geo to test and one to control so that differences are driven by treatment, not structural bias. Lock your objective and set up upfront (platform optimisation models or bidding strategies) before deciding the treatment (increase budget by a predefined factor for instance). Use a clean pre-period to validate balance between arms and to calibrate your model. Run the test long enough to cover at least one full business cycle for that market (weeks, not days). Avoid mid-flight tuning that diverges between test and control (no changing targets or creative strategy in one side only). Read incremental impact as the difference in KPI trajectories between treated and control geos, adjusted for pre-period. Feed results back into budget allocation rules and into any higher-level portfolio model. Run 1–2 large geo tests per year on top-of-funnel budgets.

Industrialize difference-in-differences (DiD) for product or specific channel changes.

DiD is a versatile ex-post tool. It offers a causal read after the fact when the business has already changed something and nobody planned a clean RCT ahead. For DiD to work, the change must be rolled out in some units (markets, regions, segments, stores) but not others. The method compares before/after trends between treated and control units under a key assumption: in the absence of the change, their trends would have evolved in parallel (the parallel trend assumption). If the change is applied everywhere at the same time, you are limited to descriptive before/after with no strong causal read.

Practical decision rule. I cannot randomize at user level + budget and activation decisions are made by unit + I cannot randomize at geo level because changes were already made + changes have only shipped to some units and there is a set of comparable untreated units + changes shipped with a clear rollout date.

Where it fits. A new or changed version of the product’s site, app, UX, landing pages, features, promotions, pricing rules, logistics, sales processes, bidding logic, structure, fraud and risk rules deployed to some users, stores or geos but not all. A channel or partner has been turned on/off or scaled only in some units and not others.

Question answered. What did the product or ops update actually do in the affected perimeter? What is the incremental impact of turning this channel/partner/program on or off in the units where it was rolled out?

How to do it. Build standard DiD templates including the models and notebooks. Clarify the outcome metric (revenue per user, conversion rate, installs), the unit of analysis (geo, store, segment), the rollout date, and the treated vs control flags. Estimate a basic DiD model with unit and time fixed effects and cluster-robust standard errors at unit level. Define a clean pre-period and post-period and perform pre-trend diagnostics. Systematically check pre-trends visually and via simple tests. If treated and control units diverge before the change, do not claim causal impact. Run sensitivity checks by varying pre/post windows, excluding obvious outliers and stress periods (like Black Friday), and comparing results across alternative outcome metrics.

3. The Optimization Layers

The following experiments are not measurement methods, they need ot be treated as optimiszation layers that need be built on top of sustainable incremental experiments.

Pilot uplift modeling once the above is stable to optimize targeting for high-volume channels or campaigns.

This model estimates the incremental effect per user or segment and optimizes who we should treat. In other words, we use uplift modeling when the question shifts from “does this work?” to “for whom is this worth it?”. It offers high upside, but to be credible, it must sit on top of a mature experimentation culture. You must already have justified that your channel or tactic is incremental with historical RCTs, stable holdout experiments and robust data pipelines.

Practical decision rule. I’ve already proven the channel is incremental + I want to squeeze more ROI.

Where it fits. Use it with large CRM programs, push/app notifications, retargeting emails at scale. Use it to shut off low- or negative-incrementality segments like heavy purchasers that convert regardless and to throttle pressure for marginal segments.

Questions answered. Among all the users I could target, for whom is the effect really positive, and where am I wasting budget? Which users should we actually reach, how often, and who should we stop retargeting because the next touchpoint is no longer incremental? Given that the program is already proven incremental, how far can we push frequency and pressure before extra impressions and emails become mostly waste? Where is the tipping point between useful reminders and pure cannibalization?

How to do it. How to do it. Start with CRM or retargeting, where you have both volume and historical experiments. Build a training dataset from past RCTs and holdouts with explicit treatment flags and outcomes over a fixed horizon. Engineer features that reflect user value and sensitivity (recency, frequency, monetary value, product mix, lifecycle stage, previous channel exposure, prior response to marketing). Choose an uplift modeling approach (two-model T-learner, X-learner, or dedicated uplift trees/forests) and use standard ML algorithms like gradient-boosted trees under the hood. Tune on out-of-sample incremental lift, not just AUC or log-loss, and apply regularization so the model doesn’t hallucinate uplift on tiny slices. Define clear treatment policies from the scores: who to always treat, who to sometimes treat (with caps), who to never treat. Then A/B test “uplift-based targeting” vs current targeting logic, keeping budgets, objectives and bidding strategies identical. Only scale once you see a clear, statistically robust gain in incremental revenue or profit, and refresh the model and thresholds on a regular cadence as new experiments come in.

Reserve synthetic control for strategic product or channel bets.

This is an optimisation of the DiD family for cases with very few treated units and many candidate control units (markets, regions, segments, stores). It is extremely powerful but statistically heavy and sensitive to specification. Hence, it is not suitable for day-to-day decisions. Keep it very niche and use it only for a few big strategic questions, when stakes are large budgets, core countries, and board-level decisions.

Practical decision rule. Same as DiD + only one/few treated markets + many potential control units with good pre-period data.

Where it fits. Large media plans and strategic brand pushes (YouTube, PMax or cross-channel bundles) in a single or very few countries. Major structural changes in one core market (new proposition, pricing architecture, distribution strategy) where you have multiple other markets that did not change.

Question answered. We made a big move in one key market or on one asset: what would have happened otherwise?

How to do it. Select one treated unit and a pool of candidate control units with long, clean pre-period data on the same KPIs. Choose predictors like past outcomes, macro indicators, channel mix and relevant firmographics. Estimate non-negative weights on control units that minimise the pre-period gap between the weighted control and the treated unit, with weights constrained to sum to one. Lock this model specification before looking at post-treatment data. Build a synthetic twin from multiple control units to approximate one treated unit’s counterfactual. Check fit quality in the pre-period. If the synthetic twin does not closely track the treated unit historically, don’t trust the counterfactual. Once fit is acceptable, compare post-period outcomes of the treated unit to its synthetic twin to estimate incremental impact. Run placebo tests by pretending each control unit was “treated” in turn, to validate that the observed gap in the true treated unit is large relative to noise and not something that appears everywhere.

Plan A/B and multivariate testing (A/B/n) in classic CRO style.

Do not mix it up with the first manual A/B testing from earlier. This one tests relative performance within the treated world (creative A vs B, landing page X vs Y, journey variant 1 vs 2). A/B/n is classic conversion rate optimization (CRO): there is no true “no-ads” control arm, so this is not a pure incrementality test. It optimizes what you show, not whether showing something is incremental at all. Hence it runs continuously inside channels that are already proven incremental and focuses on creatives, landing pages, UX, messaging, and frequency caps.

Practical decision rule. I’ve already proven the channel is incremental + I want to know which version wins inside that incremental envelope.

Where it fits. Meta, TikTok and YouTube creative testing; search and shopping ad copy and extensions; landing page and funnel experiments on web and app; paywall / pricing page variants; email subject lines and templates; CRM journeys where you keep the same logic but vary content or layout. Web/app CRO stacks such as Optimizely or AB Tasty to randomize traffic and serve variants, with GA4 (Google Analytics) as the measurement layer for event instrumentation and KPI readout.

Questions answered. Within this incremental channel, which variant gives the best incremental performance? Which creative, landing page, UX flow or message converts best given that we are already paying to reach these users? At a fixed objective (CPA, ROAS, conversion rate), which configuration should become the new default?

How to do it. Make A/B and multivariate tests a standard, always-on practice in channels where incrementality has already been validated by lift or holdout. In your testing stack, define a small number of clear variants (A/B or A/B/C rather than A/Z), fix the optimization objective and bidding model upfront. Run the test inside the same optimization model and bidding strategy the platform already uses (tCPA, tROAS, Maximize Conversions). Split traffic randomly and symmetrically across variants. Avoid mid-test tuning that biases the result like sendin better traffic to one variant or change budgets and bids in only one arm. Let the delivery algorithm stabilise for each variant, run the test until you hit the required sample size, then select the winner and promote it to the new baseline before starting the next test.

4. The Portfolio Level Modeling

Once the user-level, cluster, and optimization layers are in place, you still need a macro view that arbitrates spend across channels and markets over quarters, not just between tactics inside a platform. That’s where portfolio-level modeling comes in.

Make Marketing Mix Modeling (MMM) your portfolio steering layer.

Marketing Mix Modeling is a portfolio-level macro method. It uses econometric models at weekly or monthly level to estimate each channel’s contribution to business outcomes and its diminishing returns. It sits above campaign-level methods and is used to steer budget allocation and high-level strategy. It should be reconciled with evidence from lift, holdouts, geo and DiD instead of treated as a separate truth.

Practical decision rule. I need to (re)allocate significant budget across channels/markets over quarters or years + I have at least 1–2 years of reasonably clean time-series data + I want a top-down view that complements experiment results rather than replaces them.

Where it fits. MMM is used to understand the portfolio effect of search, shopping, PMax, app campaigns, YouTube, Meta, TikTok, affiliates, podcasts, CRM and offline channels together, including the impact of seasonality, price, promotions and macro factors. It is the right layer when the question is “How should I split 10M across channels and markets next quarter?” not “Is this specific campaign lift-positive?”.

Questions answered. What is the incremental contribution of each channel at the margin, controlling for other channels, price and seasonality? Where are the diminishing returns by channel and by market? What is the expected impact of shifting budget from channel A to channel B? What is the optimal budget mix under a given spend constraint or target (revenue, profit, installs)? How do top-of-funnel and offline channels contribute over longer time horizons where user-level experiments are blind?

How to do it. Make MMM your portfolio steering layer, not your first-line truth. Build a weekly or monthly panel that includes spend by channel and sub-channel, business KPIs (revenue, orders, installs), control variables (seasonality, promotions, price, macro, competition proxies) and clear channel definitions aligned with how you actually buy media. Specify a response model with adstock (carryover) and saturation curves per channel (regularised regression or Bayesian hierarchical models with non-linear terms), and lock sane priors or constraints so the algo doesn’t attribute absurd ROAS to noise channels. Tune the model via out-of-sample validation, cross-validation and stability checks, and use ground truth from experiments (lift, holdouts, geo, DiD) to calibrate or anchor key channels instead of letting the model free-run. Once the model is stable, use it with an optimisation layer (solver) to generate recommended budget allocations and scenarios, refresh it on a quarterly cadence, and treat its outputs as input to planning discussions, always cross-checked against experimental evidence and business reality.

Explore more

To dig deeper into when to trigger, design, and track these different methods, take a look at this blog post: Anchoring Incremental Experiments in the Marketing Calendar