🐍 Python Stats Cheatsheet

← Back to Calculators

Copy-paste ready Python examples for every statistical test

Your Python Stats Survival Kit πŸŽ‰

Every example below is 100% copy-paste ready. Just plug in your data and run! Each code block includes all imports, calculations, and interpretations.

No PhD required. No complicated setup. Just practical Python code for real marketing analysis.

βš™οΈ

One-Time Setup

Install these packages once (skip if already installed):

pip install numpy scipy pandas statsmodels matplotlib

βœ“ All examples below use these packages

πŸ“Š

T-Test

"Comparing two groups (like A/B testing)"

When to use:

βœ“ Comparing averages of 2 groups
βœ“ A/B testing conversion rates, email open rates, etc.
βœ“ Before/after comparisons

Copy-Paste Example: Independent T-Test

import numpy as np
from scipy import stats

# Your data: conversion rates (%) for two campaigns
campaign_a = np.array([5.2, 6.1, 5.8, 6.3, 5.5, 6.0, 5.7])
campaign_b = np.array([4.8, 5.2, 5.0, 5.5, 4.9, 5.1, 5.3])

# Run the test
t_stat, p_value = stats.ttest_ind(campaign_a, campaign_b)

# Calculate Cohen's d (effect size)
mean_diff = np.mean(campaign_a) - np.mean(campaign_b)
pooled_std = np.sqrt((np.var(campaign_a, ddof=1) + np.var(campaign_b, ddof=1)) / 2)
cohens_d = mean_diff / pooled_std

# Results
print(f"Campaign A mean: {np.mean(campaign_a):.2f}%")
print(f"Campaign B mean: {np.mean(campaign_b):.2f}%")
print(f"P-value: {p_value:.4f}")
print(f"Cohen's d: {cohens_d:.3f}")
print(f"Significant? {'YES βœ“' if p_value < 0.05 else 'No βœ—'}")

How to interpret:

  • P-value < 0.05? Campaigns are significantly different
  • Cohen's d: 0.2=small, 0.5=medium, 0.8=large effect
  • Means: Which campaign performed better?
πŸ“ˆ

ANOVA

"Comparing 3+ groups at once"

When to use:

βœ“ Testing 3+ variations simultaneously
βœ“ Comparing multiple email subject lines
βœ“ Analyzing performance across regions

Copy-Paste Example: One-Way ANOVA

import numpy as np
from scipy import stats

# Your data: open rates (%) for three subject line types
questions = np.array([12.5, 15.3, 13.8, 14.2, 16.1])
urgency = np.array([18.2, 19.5, 17.8, 20.1, 18.9])
personalized = np.array([14.5, 15.8, 13.9, 16.2, 14.7])

# Run ANOVA
f_stat, p_value = stats.f_oneway(questions, urgency, personalized)

# Calculate eta-squared (effect size)
all_data = np.concatenate([questions, urgency, personalized])
grand_mean = np.mean(all_data)
ssb = len(questions) * (np.mean(questions) - grand_mean)**2 + \
      len(urgency) * (np.mean(urgency) - grand_mean)**2 + \
      len(personalized) * (np.mean(personalized) - grand_mean)**2
sst = np.sum((all_data - grand_mean)**2)
eta_squared = ssb / sst

# Results
print(f"Questions: {np.mean(questions):.1f}%")
print(f"Urgency: {np.mean(urgency):.1f}%")
print(f"Personalized: {np.mean(personalized):.1f}%")
print(f"\nF-statistic: {f_stat:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Eta-squared: {eta_squared:.3f}")
print(f"Significant? {'YES βœ“' if p_value < 0.05 else 'No βœ—'}")

How to interpret:

  • P-value < 0.05? At least one group differs (but doesn't say which!)
  • Eta-squared: % of variance explained by groups
  • Look at means: Which group is highest/lowest?
🎲

Chi-Square

"Testing if categories are related"

When to use:

βœ“ Testing if age affects product preference
βœ“ Checking if day of week affects purchases
βœ“ Analyzing survey responses by demographics

Copy-Paste Example: Test of Independence

import numpy as np
from scipy import stats

# Your data: Age group vs Product preference
# Rows: 18-29, 30-44, 45-60  |  Columns: Product A, B, C
observed = np.array([
    [45, 30, 25],  # 18-29
    [35, 50, 30],  # 30-44
    [20, 40, 55]   # 45-60
])

# Run Chi-Square test
chi2, p_value, dof, expected = stats.chi2_contingency(observed)

# Calculate CramΓ©r's V (effect size)
n = np.sum(observed)
min_dim = min(observed.shape[0] - 1, observed.shape[1] - 1)
cramers_v = np.sqrt(chi2 / (n * min_dim))

# Results
print("Observed frequencies:")
print(observed)
print(f"\nChi-square: {chi2:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"CramΓ©r's V: {cramers_v:.3f}")
print(f"Significant? {'YES βœ“' if p_value < 0.05 else 'No βœ—'}")

# Interpret effect size
if cramers_v < 0.1:
    strength = "weak"
elif cramers_v < 0.3:
    strength = "moderate"
else:
    strength = "strong"
print(f"Association: {strength}")

How to interpret:

  • P-value < 0.05? Variables are related
  • CramΓ©r's V: 0=no relationship, 1=perfect relationship
  • Check table: Which cells are way higher/lower than expected?
πŸ“ˆ

Correlation

"How two things move together"

When to use:

βœ“ Is ad spend related to sales?
βœ“ Does email length affect open rate?
βœ“ Measuring any linear relationship

Copy-Paste Example: Pearson Correlation

import numpy as np
from scipy import stats

# Your data
ad_spend = np.array([10, 15, 12, 18, 20, 14, 16, 22, 19, 17])  # $1000s
sales = np.array([50, 75, 60, 90, 100, 70, 80, 110, 95, 85])   # $1000s

# Calculate correlation
r, p_value = stats.pearsonr(ad_spend, sales)
r_squared = r ** 2

# Results
print(f"Correlation (r): {r:.3f}")
print(f"R-squared: {r_squared:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant? {'YES βœ“' if p_value < 0.05 else 'No βœ—'}")

# Interpret strength
if abs(r) < 0.3:
    strength = "weak"
elif abs(r) < 0.7:
    strength = "moderate"
else:
    strength = "strong"
direction = "positive" if r > 0 else "negative"
print(f"\n{strength.capitalize()} {direction} correlation")

How to interpret:

  • r = 1: Perfect positive relationship
  • r = 0: No relationship
  • r = -1: Perfect negative relationship
  • Remember: Correlation β‰  Causation!
πŸ“‰

Regression

"Predicting one variable from another"

When to use:

βœ“ Predicting sales from ad spend
βœ“ Forecasting clicks from email length
βœ“ Getting a prediction equation

Copy-Paste Example: Simple Linear Regression

import numpy as np
from scipy import stats

# Your data
ad_spend = np.array([10, 15, 12, 18, 20, 14, 16, 22, 19, 17])
sales = np.array([50, 75, 60, 90, 100, 70, 80, 110, 95, 85])

# Run regression
slope, intercept, r_value, p_value, std_err = stats.linregress(ad_spend, sales)

# Results
print(f"Equation: Sales = {intercept:.2f} + {slope:.2f} Γ— Ad Spend")
print(f"R-squared: {r_value**2:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant? {'YES βœ“' if p_value < 0.05 else 'No βœ—'}")

# Make predictions
new_ad_spend = 25
predicted_sales = slope * new_ad_spend + intercept
print(f"\nPrediction: ${new_ad_spend}k ad β†’ ${predicted_sales:.2f}k sales")

How to interpret:

  • Slope: For every $1k more in ad spend, sales increase by $[slope]k
  • Intercept: Predicted sales when ad spend is zero
  • R-squared: How well the line fits
🎯

Mann-Whitney

"T-test's outlier-proof cousin"

When to use:

βœ“ Comparing 2 groups with skewed data
βœ“ Data has outliers
βœ“ Small sample sizes

Copy-Paste Example: Mann-Whitney U Test

import numpy as np
from scipy import stats

# Your data: session duration (minutes) - note the outlier!
mobile = np.array([2.3, 3.1, 2.8, 3.5, 2.5, 3.0, 15.5])  # outlier!
desktop = np.array([4.5, 5.2, 4.8, 5.0, 4.3, 5.5, 4.7])

# Run Mann-Whitney test
u_stat, p_value = stats.mannwhitneyu(mobile, desktop, alternative='two-sided')

# Results (use medians, not means!)
print(f"Mobile median: {np.median(mobile):.2f} min")
print(f"Desktop median: {np.median(desktop):.2f} min")
print(f"U-statistic: {u_stat:.1f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant? {'YES βœ“' if p_value < 0.05 else 'No βœ—'}")

How to interpret:

  • P-value < 0.05? Groups differ
  • Compare medians: Which group is typically higher?
  • Why use this? Outliers don't mess up the results
🎲

Proportions (Z-Test)

"Comparing percentages and rates"

When to use:

βœ“ Comparing CTR, conversion rates
βœ“ Success rates between groups
βœ“ Large samples (n > 30)

Copy-Paste Example: Compare Two Proportions

import numpy as np
from scipy import stats

# Your data
n1 = 1000         # Landing page A visitors
conversions1 = 52 # Conversions from A
p1 = conversions1 / n1

n2 = 950          # Landing page B visitors
conversions2 = 68 # Conversions from B
p2 = conversions2 / n2

# Calculate pooled proportion
pooled_p = (conversions1 + conversions2) / (n1 + n2)

# Standard error and z-statistic
se = np.sqrt(pooled_p * (1 - pooled_p) * (1/n1 + 1/n2))
z_stat = (p1 - p2) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

# Results
print(f"Page A: {p1*100:.2f}% conversion")
print(f"Page B: {p2*100:.2f}% conversion")
print(f"Difference: {(p1-p2)*100:+.2f} pp")
print(f"Z-statistic: {z_stat:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant? {'YES βœ“' if p_value < 0.05 else 'No βœ—'}")

How to interpret:

  • P-value < 0.05? Conversion rates differ
  • Difference: How many percentage points better?
  • Watch out: Huge samples = tiny differences can be "significant"
🎰

Bayesian A/B Testing

"Probability that B beats A"

When to use:

βœ“ Want to know "how likely is B better?"
βœ“ A/B testing with continuous monitoring
βœ“ More intuitive than p-values

Copy-Paste Example: Bayesian A/B Test

import numpy as np
from scipy import stats

# Your data
n_a = 1000
conversions_a = 52
n_b = 950
conversions_b = 68

# Bayesian update: Beta(1,1) prior + data
alpha_a = 1 + conversions_a
beta_a = 1 + (n_a - conversions_a)
alpha_b = 1 + conversions_b
beta_b = 1 + (n_b - conversions_b)

# Posterior means
mean_a = alpha_a / (alpha_a + beta_a)
mean_b = alpha_b / (alpha_b + beta_b)

# 95% credible intervals
ci_a = (stats.beta.ppf(0.025, alpha_a, beta_a),
        stats.beta.ppf(0.975, alpha_a, beta_a))
ci_b = (stats.beta.ppf(0.025, alpha_b, beta_b),
        stats.beta.ppf(0.975, alpha_b, beta_b))

# Monte Carlo: P(B > A)
np.random.seed(42)
samples_a = stats.beta.rvs(alpha_a, beta_a, size=100000)
samples_b = stats.beta.rvs(alpha_b, beta_b, size=100000)
prob_b_better = np.mean(samples_b > samples_a)

# Results
print(f"A: {mean_a*100:.2f}% (95% CI: {ci_a[0]*100:.2f}%-{ci_a[1]*100:.2f}%)")
print(f"B: {mean_b*100:.2f}% (95% CI: {ci_b[0]*100:.2f}%-{ci_b[1]*100:.2f}%)")
print(f"\nP(B > A) = {prob_b_better*100:.1f}%")
print(f"P(A > B) = {(1-prob_b_better)*100:.1f}%")

# Decision
if prob_b_better > 0.95:
    print(f"\nβœ“ Strong evidence for B!")
elif prob_b_better < 0.05:
    print(f"\nβœ“ Strong evidence for A!")
else:
    print(f"\n⚠ Inconclusive - collect more data")

How to interpret:

  • P(B > A) > 95%? Very confident B is better
  • Credible interval: Range where true rate likely lies
  • Advantage: Get direct probability of which is better
πŸ“Š

Multiple Regression

Predict an outcome using multiple predictor variables at once

πŸ’‘ Scenario:

Predict monthly sales from ad spend, seasonality, and price

# === MULTIPLE REGRESSION ===
# Predict sales from ad_spend, season, and price

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

# Sample data: Monthly sales data
data = pd.DataFrame({
    'sales': [120, 145, 132, 158, 125, 169, 140, 175, 138, 162],
    'ad_spend': [10, 15, 11, 18, 12, 20, 14, 22, 13, 19],  # $1000s
    'season': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1],  # 0=off, 1=peak
    'price': [50, 48, 52, 47, 51, 46, 49, 45, 50, 47]  # in dollars
})

# Prepare predictors (X) and outcome (y)
X = data[['ad_spend', 'season', 'price']]
y = data['sales']

# === METHOD 1: sklearn (quick) ===
model = LinearRegression()
model.fit(X, y)

print("=== REGRESSION RESULTS ===")
print(f"R-squared: {model.score(X, y):.3f}")
print(f"\nCoefficients:")
for name, coef in zip(X.columns, model.coef_):
    print(f"  {name}: {coef:.2f}")
print(f"  intercept: {model.intercept_:.2f}")

# === METHOD 2: statsmodels (detailed stats) ===
X_with_intercept = sm.add_constant(X)  # Add intercept
model_sm = sm.OLS(y, X_with_intercept).fit()

print("\n=== DETAILED STATISTICS ===")
print(model_sm.summary())

# === Check for multicollinearity (VIF) ===
print("\n=== MULTICOLLINEARITY CHECK (VIF) ===")
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
print("\n⚠ VIF > 10 indicates multicollinearity problem")

# === Make predictions ===
new_data = pd.DataFrame({
    'ad_spend': [16],
    'season': [1],
    'price': [48]
})
prediction = model.predict(new_data)
print(f"\nPredicted sales: ${prediction[0]:.0f}k")

How to interpret:

  • Coefficients: Change in outcome per unit increase (holding others constant)
  • RΒ²: % of variance explained (higher = better fit)
  • P-values: Which predictors are statistically significant
  • VIF < 5: Good, VIF > 10: Multicollinearity problem
πŸ”

Post-Hoc Tests

Find which specific groups differ after significant ANOVA

πŸ’‘ Scenario:

After ANOVA shows email types differ, find which ones specifically

# === POST-HOC TESTS (Tukey HSD & Bonferroni) ===
# After ANOVA, find which groups differ

import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd, MultiComparison

# Sample data: Email open rates for 4 subject line types
questions = [12.5, 14.2, 13.8, 15.1, 14.5]  # %
urgency = [18.3, 19.5, 18.9, 19.8, 20.2]
personalized = [14.8, 15.5, 14.9, 15.8, 15.2]
emoji = [16.9, 17.5, 17.2, 18.1, 16.8]

# Combine into long format for post-hoc tests
data = pd.DataFrame({
    'open_rate': questions + urgency + personalized + emoji,
    'subject_type': ['Questions']*5 + ['Urgency']*5 + ['Personalized']*5 + ['Emoji']*5
})

# === First, run ANOVA (prerequisite) ===
groups = [questions, urgency, personalized, emoji]
f_stat, p_value = stats.f_oneway(*groups)

print("=== ANOVA (Prerequisite) ===")
print(f"F-statistic: {f_stat:.3f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("βœ“ ANOVA significant - proceed with post-hoc tests\n")

    # === Tukey HSD (Honestly Significant Difference) ===
    print("=== TUKEY HSD POST-HOC TEST ===")
    tukey = pairwise_tukeyhsd(
        endog=data['open_rate'],
        groups=data['subject_type'],
        alpha=0.05
    )
    print(tukey)

    # === Bonferroni correction ===
    print("\n=== BONFERRONI CORRECTION ===")
    mc = MultiComparison(data['open_rate'], data['subject_type'])
    bonferroni_result = mc.allpairtest(stats.ttest_ind, method='bonf')
    print(bonferroni_result[0])

else:
    print("βœ— ANOVA not significant - don't run post-hoc tests")

# === Group summaries ===
print("\n=== GROUP SUMMARIES ===")
summary = data.groupby('subject_type')['open_rate'].agg(['mean', 'std', 'count'])
print(summary)

How to interpret:

  • Tukey output: "reject=True" means those groups significantly differ
  • Bonferroni: More conservative, divides Ξ± by number of comparisons
  • Only run if ANOVA is significant! (p < 0.05)
🎯

Logistic Regression

Predict binary outcomes (yes/no, 0/1, convert/don't convert)

πŸ’‘ Scenario:

Predict whether a customer will purchase based on age, income, and website visits

# === LOGISTIC REGRESSION ===
# Predict binary outcome (0/1, yes/no)

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import statsmodels.api as sm

# Sample data: Customer purchases
data = pd.DataFrame({
    'purchased': [1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1],
    'age': [35, 28, 42, 50, 22, 38, 25, 45, 55, 30, 40, 26, 48, 24, 52],
    'income': [65, 45, 75, 85, 35, 70, 40, 80, 95, 50, 72, 42, 88, 38, 90],  # $1000s
    'visits': [5, 2, 8, 10, 1, 6, 3, 9, 12, 4, 7, 2, 11, 1, 10]
})

# Prepare data
X = data[['age', 'income', 'visits']]
y = data['purchased']

# === METHOD 1: sklearn (quick predictions) ===
model = LogisticRegression()
model.fit(X, y)

# Predictions
y_pred = model.predict(X)
y_prob = model.predict_proba(X)[:, 1]  # Probability of class 1

print("=== LOGISTIC REGRESSION RESULTS ===")
print(f"Coefficients:")
for name, coef in zip(X.columns, model.coef_[0]):
    odds_ratio = np.exp(coef)
    print(f"  {name}: Ξ²={coef:.3f}, OR={odds_ratio:.2f}")

# === Confusion Matrix ===
print("\n=== CONFUSION MATRIX ===")
cm = confusion_matrix(y, y_pred)
print(cm)
print(f"\nTrue Negatives: {cm[0,0]}, False Positives: {cm[0,1]}")
print(f"False Negatives: {cm[1,0]}, True Positives: {cm[1,1]}")

# === Classification Metrics ===
print("\n=== CLASSIFICATION REPORT ===")
print(classification_report(y, y_pred, target_names=['No Purchase', 'Purchase']))

# Accuracy
accuracy = (y == y_pred).mean()
print(f"\nAccuracy: {accuracy:.2%}")

# AUC-ROC
try:
    auc = roc_auc_score(y, y_prob)
    print(f"AUC-ROC: {auc:.3f}")
except:
    print("AUC-ROC: Need more variation in data")

# === METHOD 2: statsmodels (detailed stats & p-values) ===
print("\n=== DETAILED STATISTICS ===")
X_with_intercept = sm.add_constant(X)
logit_model = sm.Logit(y, X_with_intercept).fit()
print(logit_model.summary())

# === Make prediction for new customer ===
new_customer = pd.DataFrame({
    'age': [40],
    'income': [75],
    'visits': [8]
})
prob = model.predict_proba(new_customer)[0, 1]
print(f"\n=== PREDICTION ===")
print(f"Probability of purchase: {prob:.1%}")

How to interpret:

  • Odds Ratio (OR): OR=2.0 means doubling the odds per unit increase
  • Confusion Matrix: Shows correct/incorrect predictions
  • Precision: Of predicted "Yes", how many are really "Yes"
  • Recall: Of actual "Yes", how many did we catch
  • AUC-ROC: 0.5=random, 1.0=perfect
πŸ”΅

Cluster Analysis (K-Means)

Find natural groups/segments in your data without predefined labels

πŸ’‘ Scenario:

Segment customers into groups based on age, income, and purchase behavior

# === K-MEANS CLUSTERING ===
# Find natural customer segments

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Sample data: Customer information
data = pd.DataFrame({
    'customer_id': range(1, 21),
    'age': [25, 45, 32, 52, 28, 48, 35, 55, 22, 50, 30, 46, 27, 53, 33, 49, 24, 51, 29, 47],
    'income': [35, 75, 42, 85, 38, 78, 45, 90, 32, 80, 40, 72, 36, 88, 44, 76, 34, 82, 39, 74],  # $1000s
    'purchase_freq': [2, 12, 3, 15, 2, 13, 4, 18, 1, 14, 3, 11, 2, 16, 4, 12, 2, 15, 3, 13]  # per year
})

# Select features for clustering
features = ['age', 'income', 'purchase_freq']
X = data[features]

# === IMPORTANT: Standardize features (required for K-means!) ===
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# === Find optimal number of clusters (Elbow Method) ===
wcss = []  # Within-Cluster Sum of Squares
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)

print("=== ELBOW PLOT DATA ===")
for k, score in zip(K_range, wcss):
    print(f"k={k}: WCSS={score:.2f}")

# Plot elbow curve (look for the "elbow")
# plt.plot(K_range, wcss, 'bo-')
# plt.xlabel('Number of Clusters (k)')
# plt.ylabel('WCSS')
# plt.title('Elbow Method')
# plt.show()

# === Run K-means with chosen k (let's say k=3) ===
k = 3
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
data['cluster'] = kmeans.fit_predict(X_scaled)

print(f"\n=== K-MEANS WITH k={k} ===")

# === Cluster centroids (in original scale) ===
centroids_scaled = kmeans.cluster_centers_
centroids = scaler.inverse_transform(centroids_scaled)

print("\nCluster Centroids (average values):")
centroid_df = pd.DataFrame(centroids, columns=features)
centroid_df.index.name = 'Cluster'
print(centroid_df)

# === Cluster sizes ===
print("\n=== CLUSTER SIZES ===")
cluster_counts = data['cluster'].value_counts().sort_index()
print(cluster_counts)

# === Cluster summaries ===
print("\n=== CLUSTER SUMMARIES ===")
for cluster_id in range(k):
    cluster_data = data[data['cluster'] == cluster_id]
    print(f"\nCluster {cluster_id} ({len(cluster_data)} customers):")
    print(cluster_data[features].describe())

# === Interpret clusters ===
print("\n=== CLUSTER INTERPRETATION ===")
for idx, row in centroid_df.iterrows():
    age = "Young" if row['age'] < 35 else "Middle-aged" if row['age'] < 50 else "Mature"
    income = "low-income" if row['income'] < 45 else "mid-income" if row['income'] < 70 else "high-income"
    freq = "infrequent" if row['purchase_freq'] < 5 else "regular" if row['purchase_freq'] < 12 else "frequent"
    print(f"Cluster {idx}: {age}, {income}, {freq} buyers")

# === Assign new customer to cluster ===
new_customer = pd.DataFrame({
    'age': [40],
    'income': [60],
    'purchase_freq': [8]
})
new_customer_scaled = scaler.transform(new_customer)
cluster_assignment = kmeans.predict(new_customer_scaled)[0]
print(f"\n=== NEW CUSTOMER ASSIGNMENT ===")
print(f"New customer assigned to Cluster {cluster_assignment}")

How to interpret:

  • Elbow plot: Look for "elbow" where adding clusters stops helping much
  • Centroids: Average values for each cluster (the "typical" customer)
  • WCSS: Lower is tighter clusters (but watch for overfitting)
  • ⚠ Always standardize! Features with larger scales will dominate
πŸ“‹

Quick Reference Table

Test Use When Sample Size Python Function
T-Test 2 groups, continuous Small-Medium stats.ttest_ind()
ANOVA 3+ groups, continuous Small-Medium stats.f_oneway()
Chi-Square Categories related? Any stats.chi2_contingency()
Correlation Linear relationship Medium-Large stats.pearsonr()
Regression Predict one from other Medium-Large stats.linregress()
Mann-Whitney 2 groups, skewed data Small-Medium stats.mannwhitneyu()
Z-Test Compare %s, rates Large (>30) Manual calc
Bayesian A/B test probability Any stats.beta

πŸ’‘ Pro Tips

Before Running Tests:

  • βœ“ Plot your data first
  • βœ“ Check for outliers
  • βœ“ Consider sample size
  • βœ“ Think about effect size

After Getting Results:

  • βœ“ Report effect size, not just p-value
  • βœ“ Include confidence intervals
  • βœ“ Think about practical significance
  • βœ“ Be honest about limitations

⚠️ Common Pitfalls:

  • ❌ P-hacking: Running tests until you find p<0.05
  • ❌ Multiple comparisons without correction
  • ❌ Assuming causation from correlation
  • ❌ Using wrong test for your data type

Made with ❀️ for marketing students and data analysts

Last Updated: January 2025 β€’ MIT License