Troubleshooting Guideο
This guide covers common issues and solutions when using GeoStep for geographic randomized controlled trials.
π¨ Common Issues & Solutionsο
Data Preparation Issuesο
Problem: βKeyError: Column not foundβο
# Error: KeyError: 'geo_id'
Solution: Ensure your data contains all required columns:
from geostep.data_handler import validate_data
# Check required columns
required_cols = ['geo_id', 'date', 'sales']
validate_data(df, required_cols, date_col='date')
Problem: Date parsing errorsο
# Error: ValueError: time data does not match format
Solution: Standardize date format before analysis:
import pandas as pd
# Convert to datetime
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
# Or let pandas infer the format
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
Problem: Insufficient historical dataο
# Error: Not enough pre-period data for power analysis
Solution: Ensure adequate historical data:
# Minimum 8 weeks of pre-period data recommended
min_weeks = 8
date_range = df['date'].max() - df['date'].min()
if date_range.days < (min_weeks * 7):
print(f"Warning: Only {date_range.days} days of data available")
Randomization Issuesο
Problem: Unbalanced groups after randomizationο
# Check balance
balance_check = df.groupby('assignment')['sales'].agg(['mean', 'std', 'count'])
print(balance_check)
Solution: Use stratified randomization for better balance:
from geostep.designer import stratified_randomization
# Stratify by key variables
df_assigned = stratified_randomization(
df,
geo_col='geo_id',
strat_vars=['avg_sales', 'population', 'region'],
n_strata=4
)
Problem: Too few geos per groupο
Solution: Minimum 10-15 geos per group recommended:
group_sizes = df['assignment'].value_counts()
if group_sizes.min() < 10:
print("Warning: Small group sizes may reduce statistical power")
Power Analysis Issuesο
Problem: Low statistical powerο
Solutions:
Increase test duration:
# Test longer periods
power_results = run_power_analysis(
historical_data=data,
test_weeks_list=[8, 10, 12, 16, 20] # Longer tests
)
Use more sensitive metrics:
# Use normalized or ratio metrics
df['sales_per_capita'] = df['sales'] / df['population']
Stratified randomization:
# Reduces variance between groups
df_assigned = stratified_randomization(df, ...)
Problem: Power analysis takes too longο
Solution: Reduce simulation count for initial exploration:
# Quick power analysis
power_results = run_power_analysis(
historical_data=data,
n_sims=100 # Reduce from default 500
)
Analysis Issuesο
Problem: Non-significant results despite expected effectο
Diagnostic steps:
Check data quality:
# Look for outliers
df.boxplot(column='sales', by='assignment')
# Check for missing data
print(df.isnull().sum())
Verify test execution:
# Confirm treatment was actually applied
treatment_geos = df[df['assignment'] == 1]['geo_id'].unique()
print(f"Treatment applied to: {len(treatment_geos)} geos")
Check for contamination:
# Look for spillover effects in control geos
control_performance = df[df['assignment'] == 0].groupby('date')['sales'].mean()
control_performance.plot(title='Control Group Performance Over Time')
Problem: Unrealistic lift estimatesο
Solution: Check for data issues:
# Look for extreme values
lift_stats = analysis_results['lift_index'].describe()
print(lift_stats)
# Remove outliers if necessary
Q1 = analysis_results['lift_index'].quantile(0.25)
Q3 = analysis_results['lift_index'].quantile(0.75)
IQR = Q3 - Q1
outlier_threshold = Q3 + 1.5 * IQR
Visualization Issuesο
Problem: Plots not displayingο
Solution: Ensure proper matplotlib backend:
import matplotlib
matplotlib.use('Agg') # For headless environments
import matplotlib.pyplot as plt
# Or for interactive environments
%matplotlib inline # In Jupyter
Problem: Poor plot qualityο
Solution: Customize plot settings:
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
# Increase DPI for better quality
plt.rcParams['figure.dpi'] = 300
π§ Performance Optimizationο
Large Dataset Handlingο
# For datasets > 1M rows
import pandas as pd
# Use efficient data types
df['geo_id'] = df['geo_id'].astype('category')
df['date'] = pd.to_datetime(df['date'])
# Sample for power analysis if needed
sample_data = df.sample(n=100000, random_state=42)
Parallel Processingο
# Power analysis uses all CPU cores by default
# To limit CPU usage:
import os
os.environ['NUMBA_NUM_THREADS'] = '4' # Limit to 4 cores
π Data Quality Checklistο
Before running any analysis, verify:
Date column is in datetime format
Geographic identifiers are unique and consistent
KPI column contains numeric values only
No missing values in key columns during test period
Sufficient historical data (minimum 8 weeks recommended)
Balanced group sizes (minimum 10 geos per group)
Consistent data frequency (weekly/daily throughout)
π Getting Helpο
If you encounter issues not covered here:
Check data structure: Use
df.info()anddf.describe()Validate inputs: Use built-in validation functions
Start simple: Test with smaller datasets first
Review methodology: Ensure experimental design is appropriate
Contact support: Provide error messages and data structure details
π Best Practices for Successο
Plan ahead: Run power analysis before launching experiments
Monitor closely: Check balance and data quality regularly
Document everything: Keep detailed records of experimental setup
Test incrementally: Start with pilot experiments
Validate results: Use multiple analysis methods for robustness