Experiment Designer: Posterior-Aware Lift Test Design#

Warning

This is experimental functionality. The current implementation works at the national level but does not yet support geo-level experiment design. We are very interested in hearing from users — would this be useful in your workflow? What directions should we prioritise? Please open an issue or start a discussion on the PyMC-Marketing GitHub with your feedback.

You’ve fitted a Marketing Mix Model — now what? You know you should run a lift test to validate the model, but you face a set of awkward design questions:

  • Which channel should you test?

  • At what spend level — where on the saturation curve should the test probe?

  • For how long — what duration balances statistical power against cost?

  • What is the expected lift, given all the uncertainty in the model?

Traditional experiment design tools (e.g. GeoLift) require the practitioner to guess the expected effect size up front — the very quantity the experiment is supposed to measure. This creates a chicken-and-egg problem.

The ExperimentDesigner sidesteps this by using the fitted MMM’s posterior distribution directly. The posterior already encodes what the model believes about each channel’s response function, including its uncertainty. For every candidate experiment (channel × spend change × duration), the designer pushes each posterior draw through the adstock and saturation transforms to produce a distribution of predicted lifts — not a single point estimate. From that distribution it computes Bayesian assurance (posterior-predictive power) and ranks candidates by a composite score that balances informativeness, detectability, and cost.

Tip

If you want to jump straight to the code, skip ahead to the Workflow section.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from pymc_marketing.mmm.experiment_design import (
    ExperimentDesigner,
    generate_experiment_fixture,
)

%config InlineBackend.figure_format = "retina"

FIG_WIDTH = 10
FIG_HEIGHT = 4

CHANNEL_COLORS = {"tv": "C0", "search": "C1", "social": "C2"}
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
/Users/benjamv/mambaforge/envs/pymc-marketing-dev/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

How It Works#

This section explains the key ideas behind the ExperimentDesigner. Understanding these concepts will make the workflow outputs much easier to interpret.

The Core Insight: Using the Posterior Instead of Guessing#

Standard power analysis requires a minimum detectable effect (MDE) — the analyst must specify the effect size they expect the experiment to find. But if you already knew the effect size, you wouldn’t need the experiment.

The ExperimentDesigner resolves this by using the fitted MMM’s posterior. Each posterior draw defines a complete channel response function (adstock decay \(\alpha\), saturation half-life \(\lambda\), channel scale \(\beta\)). For a given spend change, every draw implies a different predicted lift. The result is a posterior distribution of predicted lifts rather than a single number.

This means the designer doesn’t need you to guess anything — it inherits the model’s beliefs and uncertainties about each channel. Channels the model is confident about will yield tight lift distributions; channels with wide posteriors will produce spread-out lift distributions that honestly reflect the risk that the true effect might be small.

Adstock-Aware Lift Prediction#

A spend change doesn’t produce its full effect overnight. In an MMM with geometric adstock, the effective (adstocked) spend ramps up over several weeks before reaching steady state. The designer models this ramp explicitly so that predicted lifts are realistic for short experiments.

The lift prediction pipeline works as follows:

  1. Steady-state adstocked spend. Given current weekly spend \(x\) and adstock decay \(\alpha\) with maximum lag \(L\), the steady-state adstocked spend is

\[x_{\text{ss}} = x \cdot \frac{1 - \alpha^{L}}{1 - \alpha}\]

(For normalised adstock where the weights sum to 1, \(x_{\text{ss}} = x\).)

  1. Adstock ramp. When spend changes at week \(t = 0\), the adstocked spend doesn’t jump to the new steady state immediately. At week \(t\) the partial geometric sum is

\[r_t = \frac{1 - \alpha^{t+1}}{1 - \alpha}\]

The ramp fraction \(r_t / S\) tells us what share of the steady-state effect has been reached by week \(t\). Channels with slow decay (high \(\alpha\)) ramp slowly and need longer experiments.

  1. Effective spend per week. Combining the baseline and the ramped change:

\[x_{\text{eff},t} = x_{\text{ss}} + \Delta x \cdot r_t\]

where \(\Delta x\) is the absolute spend change.

  1. Weekly lift through saturation. Each week’s lift is the difference between the response at the new effective spend and the baseline response:

\[\ell_t = f(x_{\text{eff},t}) - f(x_{\text{ss}})\]

where \(f\) is the logistic saturation function \(f(x) = \beta\,(1 - e^{-\lambda x}) / (1 + e^{-\lambda x})\).

  1. Total lift. The cumulative lift over a \(T\)-week experiment is \(L = \sum_{t=0}^{T-1} \ell_t\). Because every quantity above depends on the posterior draw \((\lambda, \beta, \alpha)\), we obtain a full posterior distribution of \(L\).

Bayesian Assurance (Posterior-Predictive Power)#

With a posterior distribution of predicted lifts in hand, the next question is: if the true effect is drawn from this posterior, how likely is the experiment to detect it?

Traditional power analysis answers this for a single assumed effect size \(\theta_0\):

\[\text{Power}(\theta_0) = P\!\bigl(\text{reject } H_0 \mid \theta = \theta_0\bigr)\]

Bayesian assurance averages power over the entire posterior, weighting each possible effect size by how plausible the model considers it:

\[\text{Assurance} = \mathbb{E}_{\theta \sim \text{posterior}}\!\left[\text{Power}(\theta)\right]\]

In practice, this is computed per posterior draw. For each draw \(i\):

  1. Compute the predicted cumulative lift \(L_i\).

  2. Derive the non-centrality parameter: \(\text{ncp}_i = |L_i| \,/\, \sigma_d\), where \(\sigma_d = \sigma_\varepsilon \sqrt{T}\) is the cumulative measurement noise over \(T\) weeks.

  3. Evaluate two-sided power via the normal CDF.

The assurance is then the mean of per-draw power values across all posterior draws.

This produces natural behaviour:

  • Well-identified channels with large posterior effects yield high assurance.

  • Uncertain channels produce a spread of power values — the assurance honestly reflects the risk that the true effect may be too small to detect.

  • Posterior mass near zero pulls assurance down, correctly signalling that the experiment may fail.

Scoring and Ranking#

Not every high-assurance experiment is worth running. A go-dark test on your largest channel might have excellent power but enormous opportunity cost. The designer ranks candidates by a weighted composite score across five normalised dimensions:

Dimension

Default weight

What it captures

Posterior uncertainty

0.20

Channels with wide posteriors — where we’d learn the most

Spend correlation

0.10

Channels contributing to the identification problem (correlated spend makes it hard to attribute effects)

Saturation gradient

0.10

Channels where a spend change produces a large response change (steep part of the curve)

Assurance

0.30

Experiments likely to produce a statistically detectable signal

Cost efficiency

0.30

Assurance per unit of spend disruption (see note below)

Each dimension is min-max normalised to \([0, 1]\) across all candidates, then combined via a weighted sum. Candidates are sorted by score in descending order.

Note

How cost efficiency is computed. Cost efficiency is defined as \(\text{assurance}\;/\;|\text{net cost}|\), where net cost is the weekly spend change multiplied by the experiment duration, expressed in model-scale units. The absolute value means a $10k spend increase and a $10k spend decrease are treated as equally disruptive. This metric does not account for the sign of the cost (savings vs. additional spend), nor does it factor in the value of the expected lift on the outcome variable. It answers a narrow question: how much operational disruption must I tolerate to get a detectable signal? — not whether the experiment is profitable or which direction of spend change is preferable.

Before scoring, a signal-to-noise ratio (SNR) pre-filter removes candidates whose expected lift is too small relative to measurement noise. This avoids wasting ranking capacity on experiments that have little chance of succeeding regardless of other factors.

The weights are configurable via the score_weights parameter, so teams can shift the balance toward informativeness, detectability, or cost depending on their priorities.

Workflow#

With the concepts established, we now walk through the full workflow using a synthetic fixture with three channels.

Step 1: Load a Fixture (or Use a Fitted MMM)#

The ExperimentDesigner can be created from:

  • A fitted MMM via ExperimentDesigner(mmm) — the primary workflow

  • A saved InferenceData via ExperimentDesigner.from_idata(idata) — for demos and testing

Here we generate a synthetic fixture with known ground-truth parameters for three channels:

Channel

λ (saturation efficiency)

β (scale)

α (adstock decay)

TV

0.5 (slow saturation)

3.0

0.7 (slow decay)

Search

2.0 (fast saturation)

1.5

0.3 (fast decay)

Social

1.0 (moderate)

0.8

0.5 (moderate)

idata = generate_experiment_fixture(
    channels=["tv", "search", "social"],
    true_params={
        "tv": {"lam": 0.5, "beta": 3.0, "alpha": 0.7},
        "search": {"lam": 2.0, "beta": 1.5, "alpha": 0.3},
        "social": {"lam": 1.0, "beta": 0.8, "alpha": 0.5},
    },
    fit_model=False,
    seed=42,
)

designer = ExperimentDesigner.from_idata(idata)
print(f"Channels: {designer.channel_columns}")
print(f"Posterior draws: {designer.n_draws}")
print(f"Adstock l_max: {designer.l_max}, normalize: {designer.normalize}")
Channels: [np.str_('tv'), np.str_('search'), np.str_('social')]
Posterior draws: 4000
Adstock l_max: 8, normalize: True

Step 2: Inspect Posterior Uncertainty#

Before recommending experiments, we examine two diagnostic questions:

  1. How correlated is spend across channels? High pairwise correlations make it difficult for the model to attribute the outcome to individual channels — this is the identification problem. Channels with high spend correlation are prime candidates for a lift test because an experiment breaks the correlation by construction.

  2. How uncertain is the model about each channel’s response function? Wide posteriors on the saturation parameters (\(\lambda\), \(\beta\)) or the adstock decay (\(\alpha\)) indicate channels where the model has not yet learned the true response. These are the channels where an experiment would be most informative.

Tip

For a quick programmatic summary, designer.plot_channel_diagnostics() produces a compact bar-chart overview of all scoring dimensions.

Hide code cell source

corr_df = designer._spend_correlation
mask = np.triu(np.ones_like(corr_df, dtype=bool), k=1)

fig, ax = plt.subplots(figsize=(FIG_WIDTH, FIG_HEIGHT))
sns.heatmap(
    corr_df,
    mask=mask,
    annot=True,
    fmt=".2f",
    vmin=-1,
    vmax=1,
    center=0,
    cmap="RdBu_r",
    square=True,
    ax=ax,
)
ax.set_title("Pairwise Spend Correlation")
fig.tight_layout()
plt.show()

Caption: Pairwise Pearson correlation of weekly spend across channels. Values near +1 indicate channels whose spend moves together, making it harder for the model to attribute the outcome to either channel individually. High-correlation pairs are strong candidates for a lift test.

Next, we examine what the model believes about each channel’s response function. The saturation parameters \(\lambda\) (half-life) and \(\beta\) (scale) jointly determine the shape of the diminishing-returns curve. A KDE contour plot in this parameter space reveals both the uncertainty and any degeneracies — channels with tight, well-separated contours are well identified, while broad or overlapping contours indicate the model struggles to pin down the response.

Hide code cell source

fig, ax = plt.subplots(figsize=(FIG_WIDTH, FIG_HEIGHT))
for ch in designer.channel_columns:
    p = designer._posterior_samples[ch]
    sns.kdeplot(
        x=p["lam"],
        y=p["beta"],
        ax=ax,
        color=CHANNEL_COLORS[ch],
        label=ch,
        levels=5,
        fill=True,
        alpha=0.3,
    )
ax.set_xlabel(r"$\lambda$ (saturation half-life)")
ax.set_ylabel(r"$\beta$ (channel scale)")
ax.set_title(r"Joint Posterior: $\lambda$ vs $\beta$")
ax.legend()
fig.tight_layout()
plt.show()
/var/folders/pd/p2qnky2x3xl4w3mgc4lct2200000gn/T/ipykernel_46347/369862303.py:17: UserWarning: No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
  ax.legend()
../../_images/466c2bf1aea6c0421a57e0847076a6027e42fd01614a38a4e7c5ac0635f780c4.png

Caption: Joint posterior density of saturation parameters \(\lambda\) (half-life) and \(\beta\) (channel scale) for each channel. Tighter contours indicate better-identified channels; broad or elongated contours suggest the model cannot distinguish between different \(\lambda\)-\(\beta\) combinations, making the channel a strong candidate for a lift test.

Hide code cell source

fig, ax = plt.subplots(figsize=(FIG_WIDTH, FIG_HEIGHT * 0.75))
for ch in designer.channel_columns:
    alpha_samples = designer._posterior_samples[ch]["alpha"]
    sns.kdeplot(
        alpha_samples, ax=ax, color=CHANNEL_COLORS[ch], label=ch, fill=True, alpha=0.3
    )
ax.set_xlabel(r"$\alpha$ (adstock decay)")
ax.set_ylabel("Density")
ax.set_title(r"Marginal Posterior: $\alpha$")
ax.legend()
fig.tight_layout()
plt.show()

Caption: Marginal posterior density of the adstock decay parameter \(\alpha\) for each channel. Higher \(\alpha\) means slower decay and longer carryover, which affects both the adstock ramp fraction and the minimum experiment duration needed to capture the full effect.

Step 3: Visualise Saturation Curves#

The saturation curve reveals where each channel sits on its diminishing-returns curve and how uncertain the model is about the shape. Channels whose HDI band is wide at the operating point are prime candidates for a lift test.

The x-axis is labelled Adstocked Spend rather than raw weekly spend. In the MMM, raw spend is first transformed by the adstock function (which accumulates decayed spend from previous weeks) before being passed through the saturation curve. What the saturation function “sees” is this accumulated value — the steady-state adstocked spend. This is the natural scale for the saturation curve because it’s the actual input to the diminishing-returns function. The vertical dashed line marks the current adstocked spend (i.e. the operating point), and the dotted lines show where candidate spend changes would land on the curve.

Note that each channel’s adstock behaviour is different — a channel with slow decay (\(\alpha = 0.7\)) accumulates much more carryover than one with fast decay (\(\alpha = 0.3\)), so their adstocked spend values can differ substantially even if their raw weekly spends are similar. The Adstock Ramp-up plot later in this notebook visualises this decay behaviour directly.

fig, axes = plt.subplots(
    3, 1, figsize=(FIG_WIDTH, FIG_HEIGHT * 3), sharex=True, sharey=True
)

for ax, channel in zip(axes, designer.channel_columns, strict=True):
    designer.plot_saturation_curve(
        channel, spend_levels=[0.2, 0.5], ax=ax, color=CHANNEL_COLORS[channel]
    )

fig.tight_layout()
plt.show()

Caption: Posterior saturation curves for each channel. Shaded band is the 94% HDI across all posterior draws; vertical dashed line marks current weekly spend; dotted lines mark candidate spend levels (+20%, +50%).

Step 4: Recommend Experiments#

The recommend() method evaluates a grid of candidate experiments across all channels and returns a ranked list. Each candidate is defined by a channel, a fractional spend change, and a duration in weeks. The designer computes predicted lift, assurance, adstock ramp fraction, net cost, and composite score for each candidate as described in How It Works above.

The score_weights parameter controls how the five scoring dimensions are combined into the composite score. The default weights are shown in the code below. Adjusting these lets you shift the ranking toward different priorities — for example, increasing uncertainty will favour channels where the model is least certain (maximising learning), while increasing cost_efficiency will favour experiments that are cheap relative to their detectability. The weights are normalised internally, so only the relative proportions matter.

import time

spend_changes = [0.1, 0.2, 0.3, 0.5, -0.2, -0.5, -1.0]
durations = [4, 6, 8, 12]
n_candidates = len(designer.channel_columns) * len(spend_changes) * len(durations)
score_weights = {
    "uncertainty": 0.2,
    "correlation": 0.1,
    "gradient": 0.1,
    "assurance": 0.3,
    "cost_efficiency": 0.3,
}

t0 = time.perf_counter()
recommendations = designer.recommend(
    spend_changes=spend_changes,
    durations=durations,
    min_snr=2.0,
    significance_level=0.05,
    score_weights=score_weights,
)
elapsed = time.perf_counter() - t0

Hide code cell source

print(
    f"Evaluated {n_candidates} candidate experiments evaluated in {elapsed:.2f}s "
    f"({n_candidates / elapsed:.0f} candidates/sec) "
    f"\n{len(recommendations)} passed min_snr filter"
)
Evaluated 84 candidate experiments evaluated in 2.14s (39 candidates/sec) 
77 passed min_snr filter

Recommendation Table#

The returned ExperimentRecommendations object renders as an HTML table automatically in Jupyter. Slicing returns a new container.

The first three columns define the experiment — these come from the grid of candidates you specify:

Column

Meaning

Channel

Which marketing channel to test

\(\Delta\) Spend

Fractional change to weekly spend (e.g. +20%/wk, or “go-dark” for a full shutdown)

Duration

How many weeks to run the experiment

The remaining columns are outputs computed by the designer for each candidate:

Column

Meaning

E[Lift]

Posterior mean of the total cumulative lift over the experiment (model-scale units)

Lift 94% HDI

94% highest density interval of the lift distribution — shows the range of plausible outcomes

SNR

Signal-to-noise ratio: expected lift divided by measurement noise. Higher SNR means the signal is easier to detect

Assurance

Bayesian posterior-predictive power — the probability of detecting the lift, averaged over the posterior (see Bayesian Assurance)

Ramp

Adstock ramp fraction — what share of the channel’s steady-state effect is captured within the experiment duration. Low values mean the experiment is too short to see the full effect

Score

Weighted composite score used to rank candidates (see Scoring and Ranking)

recommendations[:10]
RankChannelΔ SpendDurationE[Lift]Lift 94% HDISNRAssuranceRampScore
1search-20%/wk4w-0.37[-0.51, -0.24]-5.50.990.890.896
2search+20%/wk4w0.35[0.23, 0.47]5.20.990.890.894
3search+10%/wk6w0.28[0.18, 0.37]3.40.890.930.860
4search+10%/wk8w0.38[0.25, 0.51]4.00.950.950.855
5search+30%/wk4w0.52[0.34, 0.69]7.71.000.890.831
6search-20%/wk6w-0.58[-0.79, -0.38]-7.01.000.930.831
7search+20%/wk6w0.55[0.36, 0.74]6.71.000.930.830
8search+10%/wk12w0.58[0.38, 0.78]5.00.990.960.821
9search+10%/wk4w0.18[0.12, 0.24]2.70.730.890.812
10search-20%/wk8w-0.79[-1.08, -0.51]-8.31.000.950.796

Reading the Top Recommendation#

Each recommendation includes an auto-generated rationale explaining why it was ranked where it is:

top = recommendations[0]
print(top.rationale)
search is a high-priority test target (uncertainty rank #1) and high spend correlation with social (r = 0.58).
A -20%/wk change for 4 weeks produces an expected total lift of -0.37 (94% HDI: [-0.51, -0.24]) with assurance 0.99.
Adstock ramp fraction 0.89 — net cost: -0.31 (model-scale units).

Interpreting the Score#

Why might a small spend decrease rank above a large spend increase? The composite score optimises for information per unit of disruption, not for the largest lift. Two features of the scoring make this possible:

  • Assurance is direction-agnostic. The power calculation uses the absolute value of the predicted lift. A spend decrease produces a negative lift, but assurance measures the probability of detecting any effect — positive or negative. So a -20% change and a +20% change with the same magnitude of lift will have similar assurance.

  • Cost efficiency rewards cheap experiments. Cost efficiency is defined as assurance divided by the absolute net cost. A small, short spend decrease is a low-cost experiment (in fact, it saves money), so its cost efficiency can be very high even if the expected lift is modest.

The result is that the top-ranked experiment is often the one that is easiest to detect relative to how much it disrupts the business — not necessarily the one with the largest effect. If your goal is instead to maximise learning about an uncertain channel, increase the uncertainty weight in score_weights and decrease cost_efficiency.

Important

We want your feedback on cost efficiency. The current cost efficiency metric treats spend increases and decreases symmetrically — a test that saves $10k and one that costs $10k are scored equally. In practice, decision-makers may view these very differently: a spend reduction saves money while still generating a signal, whereas a spend increase requires additional budget approval. We are actively considering alternatives (e.g. signed cost efficiency, ROI-aware scoring, or asymmetric disruption penalties) and would love to hear what would be most useful in your workflow. Please share your thoughts via a GitHub issue or discussion.

Step 5: Diagnostic Plots#

Power vs. Cost#

This scatter plot shows assurance (y-axis) against absolute net cost (x-axis) for every candidate that passed the SNR filter. The best experiments are in the upper-left region: high probability of detecting the effect at low cost.

How to read the markers:

  • Colour encodes the channel.

  • Shape encodes the spend direction: triangle-up (\(\triangle\)) = spend increase, triangle-down (\(\triangledown\)) = spend decrease, X = go-dark.

Why absolute cost? The x-axis uses \(|\text{net cost}|\) rather than signed cost because the scoring treats spend disruption symmetrically — a $10k increase and a $10k decrease are equally disruptive to normal operations, even though one saves money. Decision-makers will of course care about the sign; the marker shape tells you that. But for ranking purposes, the magnitude of disruption is what matters: a small perturbation is easier to approve and operationalise regardless of direction.

fig, ax = designer.plot_power_cost(
    recommendations, colors=CHANNEL_COLORS, figsize=(FIG_WIDTH, FIG_HEIGHT)
)
plt.show()

Caption: Bayesian assurance vs. absolute net cost for all candidate experiments passing the SNR filter. Marker color indicates channel; marker shape indicates spend direction (triangle-up = increase, triangle-down = decrease, X = go-dark).

Lift Distributions#

Drilling into a single channel, we can inspect the full posterior distribution of predicted lift across different spend changes and experiment durations. Each panel shows a density plot of the cumulative lift — this is the distribution that the assurance calculation integrates over (see Bayesian Assurance).

The vertical dashed line at zero is the key reference. When the entire distribution sits clearly away from zero (i.e. the HDI does not overlap zero), the predicted signal is large relative to noise for most posterior draws, which translates to high assurance. Conversely, when the distribution straddles zero, many posterior draws imply a lift too small to detect, pulling assurance down. Reading across columns (increasing duration) shows how longer experiments shift the distribution further from zero as the adstock ramp captures more of the steady-state effect.

fig, axes = designer.plot_lift_distributions(
    "tv",
    spend_changes=[0.2, 0.5, -0.5, -1.0],
    durations=[4, 6, 8, 12],
    color=CHANNEL_COLORS["tv"],
    figsize=(FIG_WIDTH, FIG_HEIGHT * 3),
)
plt.show()

Caption: Posterior distribution of total predicted lift for the TV channel across spend changes (rows) and experiment durations (columns). Shaded region is the 94% HDI; vertical dashed line marks zero lift. Distributions whose HDI excludes zero correspond to high-assurance experiments; those straddling zero indicate the signal may be too small to detect reliably.

Adstock Ramp-up#

When you change spend on a channel, the adstocked (effective) spend doesn’t jump to its new level immediately — it ramps up over several weeks as the geometric decay accumulates the new spend level. The ramp fraction measures how much of the full steady-state effect has been captured by week \(t\) of the experiment (see Adstock-Aware Lift Prediction).

How to read the plot. The y-axis is the ramp fraction (0 = no effect yet, 1 = full steady-state reached). Each channel’s curve shows how quickly it approaches steady state, with a shaded uncertainty band from the posterior. Channels with fast adstock decay (low \(\alpha\), e.g. search) reach 1.0 within a few weeks, while slow-decay channels (high \(\alpha\), e.g. TV) may still be well below 1.0 after 12 weeks.

Why this matters for experiment design:

  • Minimum duration. If the ramp fraction for a channel is only 0.6 at your planned experiment length, the experiment will only capture ~60% of the true steady-state effect. This understates the lift, reduces the SNR, and lowers assurance. The plot tells you how long the experiment needs to be to see the full signal.

  • Interpreting results. If you run a 4-week test on a slow-decay channel and observe a small lift, the ramp fraction tells you whether that small lift is genuinely small or simply truncated by insufficient duration.

  • Channel comparison. Two channels may have similar expected lifts, but if one reaches steady state in 4 weeks and the other needs 12, the faster channel is a better candidate for a short experiment. This is already baked into the assurance and scoring, but the ramp plot makes the mechanism visible.

fig, ax = designer.plot_adstock_ramp(
    max_weeks=16, colors=CHANNEL_COLORS, figsize=(FIG_WIDTH, FIG_HEIGHT)
)
plt.show()

Caption: Adstock ramp fraction vs. experiment duration for each channel. Shaded band is the 89% posterior interval. Channels with higher adstock α approach steady state more slowly, requiring longer experiments to capture the full effect.

Using with a Fitted MMM#

In practice, you’d create the designer directly from a fitted MMM:

from pymc_marketing.mmm.multidimensional import MMM
from pymc_marketing.mmm import GeometricAdstock, LogisticSaturation
from pymc_marketing.mmm.experiment_design import ExperimentDesigner

mmm = MMM(
    date_column="date",
    channel_columns=["tv", "search", "social"],
    target_column="revenue",
    adstock=GeometricAdstock(l_max=8),
    saturation=LogisticSaturation(),
)
mmm.fit(X, y)

designer = ExperimentDesigner(mmm)
recommendations = designer.recommend()

The designer extracts everything it needs from the fitted model: posterior samples, current spend levels, residual noise, and spend correlations.

Summary#

The ExperimentDesigner provides a principled, posterior-aware approach to lift test planning. Key takeaways:

  • No guessing required. Unlike traditional power analysis, the designer uses the MMM’s posterior to derive a full distribution of predicted lifts for every candidate experiment — eliminating the need to specify an expected effect size up front.

  • Adstock-aware predictions. Lift predictions account for the geometric adstock ramp, so short experiments on slow-decaying channels are correctly penalised.

  • Bayesian assurance. Posterior-predictive power honestly reflects model uncertainty: well-identified channels yield high assurance, while uncertain channels produce conservative estimates.

  • Configurable scoring. The composite score balances informativeness, detectability, and cost. The score_weights parameter lets you tune the ranking to match your priorities.

  • Diagnostic visualisations. Power-vs-cost plots, lift distributions, and adstock ramp charts help you build intuition and communicate recommendations to stakeholders.

Warning

Causal identifiability matters. The designer recommends experiments based on the MMM’s posterior, but the posterior is only as good as the model. If the model has identifiability issues (e.g. highly correlated channels, confounding from unmeasured variables), the predicted lifts and assurance values may be misleading. A lift test can help resolve these issues, but only if the experiment is well-designed and cleanly executed. In particular:

  • Run one lift test at a time. Simultaneous experiments on multiple channels confound each other, making it impossible to attribute the observed effect to a single channel. If you must test multiple channels, stagger them with sufficient washout periods between experiments.

  • Plan for sufficient duration. The adstock ramp fraction tells you whether the experiment is long enough to capture the full effect. Cutting an experiment short because early results look promising (or disappointing) introduces bias.

Note

This is experimental functionality — we want your feedback. The ExperimentDesigner is under active development. We are keen to hear how you use it, what works, and what should change. Topics we are especially interested in:

  • Should cost efficiency treat spend increases and decreases differently?

  • What additional scoring dimensions would be useful (e.g. business-side constraints, seasonal timing)?

  • Would geo-level experiment design be valuable for your use case?

Please open an issue or start a discussion on the PyMC-Marketing GitHub. If you’d like hands-on support with lift test planning, calibration, or MMM development, PyMC Labs offers consulting and mentoring.