Comparing Python packages for A/B test analysis: tea-tasting, Pingouin, statsmodels, and SciPy
a/b testing statistics tea-tasting python
Disclosure: I am also the author of tea-tasting.
This article compares four Python packages that are relevant to A/B test analysis: tea-tasting, Pingouin, statsmodels, and SciPy. It does not try to pick a universal winner. Instead, it clarifies what each package does well for common experimentation tasks and how much manual work is needed to produce production-style A/B test outputs.
It assumes familiarity with A/B testing basics, including randomization, p-values, and confidence intervals.
A/B test setting and analysis requirements #
A/B tests in a nutshell #
An A/B test compares two (or more) variants of a product change by randomly assigning experimental units to variants and measuring outcomes. In online experiments, the randomization unit is usually the user, and the standard assumption is that units are independent.
A typical workflow is:
- Design the experiment: choose the randomization unit (usually users), define the target population, and estimate sample size and duration with power analysis.
- Run the experiment: ship the treatment, randomize traffic, and collect data.
- Analyze and interpret results: compute control and treatment metric values, estimate effects with confidence intervals, and report p-values.
Good references for this mindset include books and papers by Ron Kohavi and Alex Deng, especially on trustworthy experimentation, delta-method metrics, and CUPED.
Typical metric types and tests #
The table below summarizes common A/B test metric families and the tests usually applied.Metric type Examples Typical test Average Average revenue per user, average orders per user Welch's t-test (unequal-variance two-sample t-test) Ratio of averages Average revenue per order, average orders per session Welch's t-test with variance from the delta method Proportion Proportion of users with at least one order Asymptotic tests (Z-test, G-test, Pearson's chi-squared) or exact tests (Boschloo, Barnard, Fisher)
Typical per-metric analysis output #
For each metric, analysts usually want the same core fields:
- Metric value in control.
- Metric value in treatment.
- Effect size estimate and confidence interval (absolute and/or relative).
- P-value.
This output format is what makes A/B test analysis convenient for repeated use across many experiments.
A/B testing specifics #
Some details matter a lot in real experimentation workflows.
- Relative effect size is often easier to interpret than absolute effect size. A common mistake is to divide an absolute confidence interval by the control mean. That does not generally produce a valid confidence interval for a relative effect. Use the delta method or Fieller's theorem instead (both are standard approaches for ratio-style uncertainty).
- Variance reduction (especially CUPED, which uses pre-experiment covariates) is widely used to increase power. CUPED for ratio metrics is more subtle because it combines variance reduction with ratio estimators and delta-method variance.
- Multiple hypothesis testing correction becomes important when you track many metrics or compare multiple variants. In practice, teams usually need a clear FWER/FDR policy and adjusted p-values (or adjusted significance thresholds) in experiment reports.
- Efficiency matters. Teams often analyze many metrics across many experiments. For many tests, only aggregated statistics are required (count, mean, variance, covariance). In those cases, it is usually more efficient to compute aggregates in the data backend and send only summary statistics to Python. A convenience layer for aggregate fetching can reduce both code and latency.
Example data setup #
With that context, the examples below use the same synthetic experiment dataset generated with tea-tasting. They intentionally do not use variance reduction (no CUPED), so the package comparisons stay focused on baseline analysis.
import tea_tasting as tt
data = tt.(return_type="pandas", rng=42, n_users=5_000)
data["has_order"] = (data["orders"] > 0).(int)
control = data[data["variant"] == 0]
treatment = data[data["variant"] == 1]
print(data.(3).(index=False))
# user variant sessions orders revenue has_order
# 0 1 1 1 5.77 1
# 1 0 3 0 0.00 0
# 2 1 1 0 0.00 0The metrics used in the examples are:
orders_per_user: average orders per user.users_with_orders: proportion of users with at least one order.revenue_per_user: average revenue per user.revenue_per_order: average revenue per order (ratio of averages).
All package examples below reuse data, control, and treatment from this setup.
Package-by-package comparison #
tea-tasting #
tea-tasting is a package specifically designed for A/B test analysis. It targets experimentation workflows directly, with metrics, relative effects, CUPED, power analysis, and concise experiment-style outputs.
Best for: teams that want an A/B-testing-first workflow with minimal glue code.
This is the most compact example among the four packages because it provides A/B-specific metric classes and a high-level Experiment API.
experiment = tt.(
orders_per_user=tt.("orders"),
users_with_orders=tt.("has_order", correction=False),
revenue_per_user=tt.("revenue"),
revenue_per_order=tt.("revenue", "orders"),
)
result = experiment.(data)
print(result)
# metric control treatment rel_effect_size rel_effect_size_ci pvalue
# orders_per_user 0.511 0.556 8.8% [-0.74%, 19%] 0.0718
# users_with_orders 0.334 0.352 5.4% [-2.4%, 14%] 0.181
# revenue_per_user 5.06 5.64 11% [0.22%, 24%] 0.0455
# revenue_per_order 9.91 10.2 2.5% [-3.0%, 8.3%] 0.389A/B testing specifics:
- Power analysis: built-in (
Experiment.solve_powerand metric parameters for effect size, relative effect size, and sample size). - Relative effect and confidence interval: first-class output (including relative CI in the printed result).
- CUPED: built-in for averages and ratio metrics.
- Multiple hypothesis testing correction: built-in (
tt.adjust_fdrandtt.adjust_fwer) for experiment results, including FDR and FWER procedures. - Aggregated statistics workflow: built-in support via metrics that expose required aggregates and integration with data backends (e.g., through Ibis-supported engines).
Pingouin #
Pingouin is a user-friendly statistical package focused on convenient inferential statistics in pandas-centric workflows. It is strong for common tests and effect sizes, but it is not an A/B-specific framework.
Best for: quick pandas-based analyses of standard statistical tests.
Pingouin has a convenient t-test interface and a contingency-table chi-squared helper. It does not provide a built-in ratio-of-averages test with delta-method variance, so the revenue_per_order example is omitted.
import pingouin as pg
orders_test = pg.(
treatment["orders"],
control["orders"],
correction=True,
).iloc[0]
print(
"orders_per_user: "
f"control={control['orders'].():.3f} "
f"treatment={treatment['orders'].():.3f} "
f"effect_size="
f"{treatment['orders'].() - control['orders'].():.3f} "
f"effect_size_ci={orders_test['CI95']} "
f"pvalue={orders_test['p_val']:.4f}"
)
# orders_per_user: control=0.511 treatment=0.556 effect_size=0.045 effect_size_ci=[-0. 0.09] pvalue=0.0718
_, _, tests = pg.(
data,
x="variant",
y="has_order",
correction=False,
)
pearson = tests.loc[tests["test"] == "pearson"].iloc[0]
print(
"users_with_orders: "
f"control={control['has_order'].():.3f} "
f"treatment={treatment['has_order'].():.3f} "
f"effect_size="
f"{treatment['has_order'].() - control['has_order'].():.3f} "
f"pvalue={pearson['pval']:.4f}"
)
# users_with_orders: control=0.334 treatment=0.352 effect_size=0.018 pvalue=0.1811Notes:
revenue_per_useruses the samepg.ttest(...)pattern asorders_per_user.revenue_per_order(ratio of averages) requires manual derivation if you want a statistically correct delta-method analysis.- The proportion example above shows a p-value, but not a built-in A/B-style effect CI.
A/B testing specifics:
- Power analysis: built-in for standard cases (
power_ttest,power_ttest2n,power_chi2), but not an A/B-specific multi-metric workflow. - Relative effect and confidence interval: mostly manual for A/B-style relative lift CIs.
- CUPED: no built-in CUPED abstraction. You would implement variance reduction manually (for example, with regression).
- Multiple hypothesis testing correction: built-in p-value adjustment via
pg.multicomp(for example, Bonferroni, Holm, and FDR methods), but integration into an A/B reporting workflow is manual. - Aggregated statistics workflow: mostly expects granular arrays/DataFrames; no built-in A/B aggregate interface.
statsmodels #
statsmodels is a broad statistical modeling library with strong hypothesis testing, power analysis, and confidence interval utilities. It is less opinionated than an experimentation-specific package, which is a strength if you want building blocks and explicit control.
Best for: analysts who want mature statistical building blocks and are comfortable assembling a workflow.
The example below uses Welch-style t-tests for two average metrics and a risk-ratio test/CI for the proportion metric. revenue_per_order is omitted because there is no built-in A/B-style ratio-of-averages delta-method helper.
from statsmodels.stats.proportion import (
confint_proportions_2indep,
test_proportions_2indep,
)
from statsmodels.stats.weightstats import CompareMeans, DescrStatsW
def welch_summary(treatment_series, control_series):
cm =(
(treatment_series),
(control_series),
)
_, pvalue, _ = cm.(usevar="unequal")
ci_low, ci_high = cm.(usevar="unequal")
return (
control_series.(),
treatment_series.(),
treatment_series.() - control_series.(),
ci_low,
ci_high,
pvalue,
)
for metric in ["orders", "revenue"]:
ctrl, trt, effect, ci_low, ci_high, pvalue =(
treatment[metric],
control[metric],
)
print(
f"{metric}_per_user: "
f"control={ctrl:.3f} treatment={trt:.3f} effect_size={effect:.3f} "
f"effect_size_ci=[{ci_low:.3f}, {ci_high:.3f}] pvalue={pvalue:.4f}"
)
# orders_per_user: control=0.511 treatment=0.556 effect_size=0.045 effect_size_ci=[-0.004, 0.094] pvalue=0.0718
# revenue_per_user: control=5.062 treatment=5.641 effect_size=0.579 effect_size_ci=[0.012, 1.146] pvalue=0.0455
count1 = int(treatment["has_order"].())
nobs1 = len(treatment)
count0 = int(control["has_order"].())
nobs0 = len(control)
prop_test =(
count1=count1,
nobs1=nobs1,
count2=count0,
nobs2=nobs0,
compare="ratio",
method="log",
)
prop_ci =(
count1=count1,
nobs1=nobs1,
count2=count0,
nobs2=nobs0,
compare="ratio",
method="log",
)
print(
"users_with_orders: "
f"control={control['has_order'].():.3f} "
f"treatment={treatment['has_order'].():.3f} "
f"rel_effect_size={prop_test.ratio - 1:.3f} "
f"rel_effect_size_ci=[{prop_ci[0] - 1:.3f}, {prop_ci[1] - 1:.3f}] "
f"pvalue={prop_test.pvalue:.4f}"
)
# users_with_orders: control=0.334 treatment=0.352 rel_effect_size=0.054 rel_effect_size_ci=[-0.024, 0.138] pvalue=0.1811A/B testing specifics:
- Power analysis: strong built-in support (
TTestIndPower,NormalIndPower,GofChisquarePower, and more). - Relative effect and confidence interval: partial. There are built-in options for some cases (for example, risk ratios for proportions), but no general A/B-style relative lift CI interface across metric types.
- CUPED: no built-in one-call CUPED API, but it is practical to implement manually with regression tooling.
- Multiple hypothesis testing correction: strong built-in support (
statsmodels.stats.multitest, includingmultipletestsandfdrcorrection). - Aggregated statistics workflow: partial. Proportion tests work directly from counts/sample sizes; other workflows often still require granular arrays or more manual setup.
SciPy #
SciPy is a foundational scientific computing and statistics package used directly or indirectly by many higher-level libraries, including the others in this comparison. It provides robust hypothesis tests and exact tests, but it does not provide a high-level A/B testing workflow.
Best for: low-level building blocks and custom A/B analysis code.
This snippet shows a Welch t-test for orders_per_user and a Pearson chi-squared test for users_with_orders. Exact tests such as Fisher, Barnard, and Boschloo are also available in SciPy.
import numpy as np
from scipy import stats
orders_test = stats.(
treatment["orders"],
control["orders"],
equal_var=False,
)
orders_ci = orders_test.()
print(
"orders_per_user: "
f"control={control['orders'].():.3f} "
f"treatment={treatment['orders'].():.3f} "
f"effect_size="
f"{treatment['orders'].() - control['orders'].():.3f} "
f"effect_size_ci=[{orders_ci.low:.3f}, {orders_ci.high:.3f}] "
f"pvalue={orders_test.pvalue:.4f}"
)
# orders_per_user: control=0.511 treatment=0.556 effect_size=0.045 effect_size_ci=[-0.004, 0.094] pvalue=0.0718
contingency = np.(
[
[(control["has_order"] == 0).(), (control["has_order"] == 1).()],
[
(treatment["has_order"] == 0).(),
(treatment["has_order"] == 1).(),
],
]
)
chi2_res = stats.contingency.(contingency, correction=False)
print(
"users_with_orders: "
f"control={control['has_order'].():.3f} "
f"treatment={treatment['has_order'].():.3f} "
f"effect_size="
f"{treatment['has_order'].() - control['has_order'].():.3f} "
f"pvalue={chi2_res.pvalue:.4f}"
)
# users_with_orders: control=0.334 treatment=0.352 effect_size=0.018 pvalue=0.1811Notes:
revenue_per_useruses the samestats.ttest_ind(..., equal_var=False)pattern asorders_per_user.revenue_per_order(ratio of averages) requires manual delta-method implementation.- Relative effect confidence intervals are also manual.
A/B testing specifics:
- Power analysis: mostly manual in SciPy (or delegated to custom code / another package).
- Relative effect and confidence interval: manual.
- CUPED: manual.
- Multiple hypothesis testing correction: partial. SciPy provides
scipy.stats.false_discovery_controlfor BH/BY FDR adjustment, but broader multiple-comparison correction workflows are more limited than in statsmodels. - Aggregated statistics workflow: partial. SciPy supports some summary-statistics and contingency-table tests (for example,
ttest_ind_from_statsand chi-squared/exact tests on contingency tables), but not an A/B-specific aggregate workflow.
Feature comparison for A/B testing #
SciPy underpins much of the Python statistics ecosystem. In principle, all the capabilities discussed here can be implemented with NumPy + SciPy plus custom code. The practical question is convenience and code verbosity. For that reason, the table below uses three labels:
built-in: directly supported in a way that fits common A/B analysis tasks.partial: some built-in support exists, but not as a complete or ergonomic A/B workflow.manual: possible, but requires custom implementation/glue code.
| Feature | tea-tasting | Pingouin | statsmodels | SciPy |
|---|---|---|---|---|
| Power analysis to estimate required number of observations | built-in | built-in | built-in | manual |
| Welch's t-test or Student's t-test for analysis of averages | built-in | built-in | built-in | built-in |
| Welch's t-test with delta method for analysis of ratios of averages | built-in | manual | manual | manual |
| Two-sample proportion Z-test, G-test, or Pearson's chi-squared test | built-in | built-in | built-in | built-in |
| Relative effect size confidence intervals | built-in | manual | partial | manual |
| Variance reduction with CUPED for analysis of averages and ratios of averages | built-in | manual | manual | manual |
| Multiple hypothesis testing correction (FWER/FDR p-value adjustment) | built-in | built-in | built-in | partial |
| Working with aggregated statistics instead of granular data | built-in | manual | partial | partial |
Conclusions #
The four packages sit at different levels of abstraction.
- tea-tasting is the most A/B-specific option in this group. It is designed around metrics, experiments, relative effects, CUPED, and aggregate-based workflows.
- Pingouin is convenient for standard statistical tests and quick analysis in pandas, but A/B-specific workflows (especially ratio metrics, relative CIs, and CUPED) are mostly manual.
- statsmodels provides strong statistical building blocks and power analysis. It is a good choice when you want explicit control and are willing to assemble an experimentation workflow yourself.
- SciPy is the essential foundation. It can support almost everything with enough custom code, but it is the most verbose option for repeated A/B test reporting.
If you run many experiments with multiple metrics and need consistent outputs, the main differentiator is not just statistical correctness. It is how much A/B-specific workflow a package gives you out of the box.
Inclusion criteria for the comparison #
For transparency, here are the minimum criteria I used to decide which packages to include.
- Maintained: a recent release (for example, within a year) and recent commits (for example, within two months) reduce the risk of stale APIs and unresolved compatibility issues.
- Well documented: a package should have both a user guide and an API reference, because A/B testing code is often reused by analysts with different levels of statistical depth.
- Used by a community: a practical heuristic is at least 100 GitHub stars. This is not a quality guarantee, but it often means more examples and more edge cases have already been surfaced.
Scope note: This comparison focuses on frequentist A/B testing workflows. Bayesian-first experimentation frameworks are out of scope.
Excluded notable packages (and why):
- spotify_confidence: excluded because it lacks documentation and has not added significant new features in the last couple of years.
- ambrosia: excluded because it has not added new features in the last couple of years, aside from dependency version updates.
Note: The maintenance and documentation notes in this section are assessed as of March 1, 2026.
Resources #
The comparison above is based on the public documentation and APIs of the packages as of March 1, 2026. Current stable versions on PyPI at the time of writing:
- tea-tasting: 1.12.0, PyPI.
- Pingouin: 0.6.0, PyPI.
- statsmodels: 0.14.6, PyPI.
- SciPy stats: 1.17.1, PyPI.
A/B testing and statistics references:
- Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy online controlled experiments: A practical guide to A/B testing.
- Deng, A., Knoblich, U., & Lu, J. (2018). Applying the delta method in metric analytics: A practical guide with novel ideas.
- Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data.
- Multiple comparisons problem (Wikipedia).
© Evgeny Ivanov 2026