# Troubleshooting Guide

This guide covers common issues and solutions when using GeoStep for geographic randomized controlled trials.

## 🚨 **Common Issues & Solutions**

### **Data Preparation Issues**

#### **Problem: "KeyError: Column not found"**
```python
# Error: KeyError: 'geo_id'
```
**Solution**: Ensure your data contains all required columns:
```python
from geostep.data_handler import validate_data

# Check required columns
required_cols = ['geo_id', 'date', 'sales']
validate_data(df, required_cols, date_col='date')
```

#### **Problem: Date parsing errors**
```python
# Error: ValueError: time data does not match format
```
**Solution**: Standardize date format before analysis:
```python
import pandas as pd

# Convert to datetime
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Or let pandas infer the format
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
```

#### **Problem: Insufficient historical data**
```python
# Error: Not enough pre-period data for power analysis
```
**Solution**: Ensure adequate historical data:
```python
# Minimum 8 weeks of pre-period data recommended
min_weeks = 8
date_range = df['date'].max() - df['date'].min()
if date_range.days < (min_weeks * 7):
    print(f"Warning: Only {date_range.days} days of data available")
```

### **Randomization Issues**

#### **Problem: Unbalanced groups after randomization**
```python
# Check balance
balance_check = df.groupby('assignment')['sales'].agg(['mean', 'std', 'count'])
print(balance_check)
```
**Solution**: Use stratified randomization for better balance:
```python
from geostep.designer import stratified_randomization

# Stratify by key variables
df_assigned = stratified_randomization(
    df, 
    geo_col='geo_id',
    strat_vars=['avg_sales', 'population', 'region'],
    n_strata=4
)
```

#### **Problem: Too few geos per group**
**Solution**: Minimum 10-15 geos per group recommended:
```python
group_sizes = df['assignment'].value_counts()
if group_sizes.min() < 10:
    print("Warning: Small group sizes may reduce statistical power")
```

### **Power Analysis Issues**

#### **Problem: Low statistical power**
**Solutions**:
1. **Increase test duration**:
```python
# Test longer periods
power_results = run_power_analysis(
    historical_data=data,
    test_weeks_list=[8, 10, 12, 16, 20]  # Longer tests
)
```

2. **Use more sensitive metrics**:
```python
# Use normalized or ratio metrics
df['sales_per_capita'] = df['sales'] / df['population']
```

3. **Stratified randomization**:
```python
# Reduces variance between groups
df_assigned = stratified_randomization(df, ...)
```

#### **Problem: Power analysis takes too long**
**Solution**: Reduce simulation count for initial exploration:
```python
# Quick power analysis
power_results = run_power_analysis(
    historical_data=data,
    n_sims=100  # Reduce from default 500
)
```

### **Analysis Issues**

#### **Problem: Non-significant results despite expected effect**
**Diagnostic steps**:
1. **Check data quality**:
```python
# Look for outliers
df.boxplot(column='sales', by='assignment')

# Check for missing data
print(df.isnull().sum())
```

2. **Verify test execution**:
```python
# Confirm treatment was actually applied
treatment_geos = df[df['assignment'] == 1]['geo_id'].unique()
print(f"Treatment applied to: {len(treatment_geos)} geos")
```

3. **Check for contamination**:
```python
# Look for spillover effects in control geos
control_performance = df[df['assignment'] == 0].groupby('date')['sales'].mean()
control_performance.plot(title='Control Group Performance Over Time')
```

#### **Problem: Unrealistic lift estimates**
**Solution**: Check for data issues:
```python
# Look for extreme values
lift_stats = analysis_results['lift_index'].describe()
print(lift_stats)

# Remove outliers if necessary
Q1 = analysis_results['lift_index'].quantile(0.25)
Q3 = analysis_results['lift_index'].quantile(0.75)
IQR = Q3 - Q1
outlier_threshold = Q3 + 1.5 * IQR
```

### **Visualization Issues**

#### **Problem: Plots not displaying**
**Solution**: Ensure proper matplotlib backend:
```python
import matplotlib
matplotlib.use('Agg')  # For headless environments
import matplotlib.pyplot as plt

# Or for interactive environments
%matplotlib inline  # In Jupyter
```

#### **Problem: Poor plot quality**
**Solution**: Customize plot settings:
```python
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Increase DPI for better quality
plt.rcParams['figure.dpi'] = 300
```

## 🔧 **Performance Optimization**

### **Large Dataset Handling**
```python
# For datasets > 1M rows
import pandas as pd

# Use efficient data types
df['geo_id'] = df['geo_id'].astype('category')
df['date'] = pd.to_datetime(df['date'])

# Sample for power analysis if needed
sample_data = df.sample(n=100000, random_state=42)
```

### **Parallel Processing**
```python
# Power analysis uses all CPU cores by default
# To limit CPU usage:
import os
os.environ['NUMBA_NUM_THREADS'] = '4'  # Limit to 4 cores
```

## 📊 **Data Quality Checklist**

Before running any analysis, verify:

- [ ] **Date column** is in datetime format
- [ ] **Geographic identifiers** are unique and consistent
- [ ] **KPI column** contains numeric values only
- [ ] **No missing values** in key columns during test period
- [ ] **Sufficient historical data** (minimum 8 weeks recommended)
- [ ] **Balanced group sizes** (minimum 10 geos per group)
- [ ] **Consistent data frequency** (weekly/daily throughout)

## 🆘 **Getting Help**

If you encounter issues not covered here:

1. **Check data structure**: Use `df.info()` and `df.describe()`
2. **Validate inputs**: Use built-in validation functions
3. **Start simple**: Test with smaller datasets first
4. **Review methodology**: Ensure experimental design is appropriate
5. **Contact support**: Provide error messages and data structure details

## 📈 **Best Practices for Success**

1. **Plan ahead**: Run power analysis before launching experiments
2. **Monitor closely**: Check balance and data quality regularly
3. **Document everything**: Keep detailed records of experimental setup
4. **Test incrementally**: Start with pilot experiments
5. **Validate results**: Use multiple analysis methods for robustness