Methodology

The GeoStep library is built on the foundation of the Randomised Controlled Trial (RCT), the gold standard for causal inference. This document provides a overview of the experimental designs and statistical concepts that underpin the library.

The Gold Standard: Randomised Controlled Trial (RCT)

An RCT is an experimental design that aims to reduce bias when testing a new intervention. In the context of marketing, the “intervention” is typically an advertising campaign, promotional offer, or other marketing treatment.

Core Principle: Randomization

The fundamental principle is randomization. A population of units (in our case, geographic areas) is randomly assigned to one of two groups:

Treatment Group: This group is exposed to the intervention (e.g., they see the ad campaign)
Control Group: This group is not exposed to the intervention (business as usual)

Because the assignment is random, the only systematic difference between the two groups is the intervention itself. Therefore, any statistically significant difference in the outcome (e.g., sales) between the two groups can be attributed to the intervention.

Treatment Effect Timeline

Why RCTs Matter for Marketing

The Problem with Observational Data:

Correlation ≠ Causation: High-spending customers may see more ads, but ads may not cause the spending
Selection Bias: Campaigns often target specific audiences, confounding the results
Unobserved Confounders: External factors (seasonality, competition) can influence outcomes

The RCT Solution:

Eliminates Selection Bias: Random assignment ensures comparable groups
Controls for Confounders: Both known and unknown factors are balanced
Measures True Incrementality: Isolates the causal effect of marketing

Advanced Designs

The Stepped-Wedge Trial (SWT)

The SWT is an alternative to the classic parallel-arm RCT. In an SWT, all clusters begin in the control condition. Over time, groups (or “steps”) of clusters are randomly assigned to switch over to the treatment condition. This continues until all clusters are in the treatment group.

Advantages for Marketing:

Logistically Practical: Allows for a phased rollout of a campaign, which is often more feasible than a “big bang” launch
Statistically Powerful: Because every cluster is observed in both states, it serves as its own control
Ethically Favorable: Eventually all units receive the treatment (important for beneficial interventions)

The Staircase Design

The Staircase design is a more pragmatic and efficient variation of the SWT, formally described by Grantham et al. (2024).

Staircase Design State Transitions Figure: State diagram showing the progression of a staircase experiment from initialization through baseline period, stepped treatment rollout, and final analysis. Each state represents a distinct phase with specific data collection requirements.

Staircase Design Matrix

The Problem with SWT: A full SWT requires data collection from every cluster during every single time period, which can be expensive and burdensome
The Staircase Solution: Each cluster only provides data for a limited number of periods just before and just after it “steps” into treatment

This focuses data collection on the most statistically valuable periods, significantly reducing operational cost without major loss of power. GeoStep is the first marketing analytics library to formally implement this cutting-edge design.

Key Benefits:

40-60% reduction in data collection requirements
Minimal power loss (<10% in most scenarios)
Flexible implementation across different business contexts

Core Analysis Methods

Figure: The complete statistical analysis pipeline showing data flow from raw inputs through validation, analysis, and reporting. The sequence diagram illustrates how different components interact to produce robust experimental results.

1. Normalized Lift Index Method (Primary)

The primary analysis method used in GeoStep is based on a t-test on normalised values:

Baseline Calculation: For each geographic unit, calculate its average performance during the pre-test period
Projection: Use this to project what performance would have been during the test period without intervention
Lift Index: Calculate the percentage difference between actual and projected performance:
```
Lift Index = (Actual Test Period / Projected Test Period) - 1
```
Statistical Test: Use a t-test with cluster-robust standard errors to test if treatment lift differs from control

Why Normalization Works:

Accounts for Size Differences: Large and small geos are compared fairly
Controls for Trends: Pre-period performance captures underlying trends
Robust to Outliers: Percentage changes are less sensitive to extreme values

2. Difference-in-Differences (DiD) Analysis

As a robustness check, GeoStep also implements DiD analysis:

Treatment Effect = (Treatment_Post - Treatment_Pre) - (Control_Post - Control_Pre)

This double-differencing removes:

Time-invariant differences between treatment and control
Common time trends affecting both groups

3. Cluster Randomized Trial (CRT) Analysis

For stepped-wedge and staircase designs, we use mixed-effects models:

Fixed Effects: Time periods and treatment status
Random Effects: Geographic clusters
Cluster-Robust SEs: Account for within-cluster correlation

Statistical Interpretation

Understanding Effect Sizes

GeoStep reports multiple measures of effect:

Lift Percentage: The relative increase (e.g., 2.91% lift)
Raw Volume Uplift: The absolute increase (e.g., 558 additional units)
Cohen’s d: Standardized effect size for meta-analysis
Hedges’ g: Bias-corrected standardized effect size

Significance Levels

Results include clear significance indicators:

P-Value Range	Indicator	Interpretation
p < 0.001	`***`	Highly significant
p < 0.01	`**`	Very significant
p < 0.05	`*`	Significant
p ≥ 0.05	(not significant)	No evidence of effect

Confidence Intervals

We provide 95% confidence intervals by default:

Interpretation: We are 95% confident the true effect lies within this range
Practical Use: If CI excludes zero, the effect is statistically significant
Business Impact: Use the lower bound for conservative planning

Power Analysis Methodology

Determining Sample Size

Power analysis helps determine how many geographic units and how long to run your test:

Power Analysis Comparison

Key factors affecting power:

Effect Size: Larger effects are easier to detect
Sample Size: More geos increase power
Variance: Lower variance increases power
Test Duration: Longer tests capture more data
Significance Level: Standard is α = 0.05

Minimum Detectable Effect (MDE)

The MDE is the smallest effect your test can reliably detect:

MDE ≈ 2.8 × (σ / √n) × √(1 + ρ(m-1))

Where:

σ = standard deviation of the outcome
n = number of geographic units
ρ = intra-cluster correlation
m = number of time periods per cluster

Design Comparison

Design Comparison

Design	Best For	Power	Cost	Complexity
Simple RCT	Quick tests, abundant data	Medium	Low	Low
Stratified RCT	Heterogeneous markets	High	Low	Medium
Stepped-Wedge	Phased rollouts	High	High	High
Staircase	Efficient phased rollouts	High	Medium	High

Data Collection Efficiency

Data Collection Efficiency

The chart above shows how different designs compare in terms of data requirements:

Simple RCT: Requires continuous data for all units
Stepped-Wedge: Full data matrix needed
Staircase: Only collects data around transition periods (40-60% reduction)

Advanced Topics

1. Spillover Effects

Geographic experiments can suffer from spillover (treatment affecting control units):

Detection: Compare border vs. interior control units
Mitigation: Use buffer zones or larger geographic units
Analysis: Spatial econometric models

2. Heterogeneous Treatment Effects

Treatment effects may vary across segments:

Analysis: Interaction terms in regression models
Visualization: Subgroup forest plots
Interpretation: Consider operational feasibility of targeting

3. Multiple Testing Corrections

When testing multiple hypotheses:

Bonferroni Correction: Conservative, controls family-wise error rate
Benjamini-Hochberg: Less conservative, controls false discovery rate
Pre-registration: Specify primary vs. secondary outcomes

4. Bayesian Methods

For advanced users, Bayesian approaches offer:

Prior Information: Incorporate historical data
Probabilistic Statements: “95% probability the lift is between 2-4%”
Decision Theory: Optimize decisions under uncertainty

Best Practices

1. Pre-Test Planning

Define primary KPI and success criteria
Run power analysis to determine sample size
Document analysis plan before seeing results

2. Randomization Quality

Check balance on observable covariates
Use stratification for important variables
Document randomization procedure

3. Data Quality

Monitor data completeness throughout test
Check for anomalies or data issues
Validate treatment delivery

4. Analysis Rigor

Run primary analysis as pre-specified
Conduct robustness checks (DiD, sensitivity)
Report all results, not just significant ones

5. Business Integration

Translate statistical significance to business impact
Consider practical significance vs. statistical significance
Document lessons learned for future tests

References

Grantham, K. L., et al. (2024). “The Staircase Design: A more efficient alternative to the stepped wedge design.” Contemporary Clinical Trials.
Hussey, M. A., & Hughes, J. P. (2007). “Design and analysis of stepped wedge cluster randomized trials.” Contemporary Clinical Trials.
Vaver, J., & Koehler, J. (2011). “Measuring ad effectiveness using geo experiments.” Google Research.
Bojinov, I., & Shephard, N. (2019). “Time series experiments and causal estimands: exact randomization tests and trading.” Journal of the American Statistical Association.

Next Steps

Getting Started: Practical implementation guide
API Reference: Detailed function documentation
Business Guide: ROI and integration strategies