Comparing Python packages for A/B test analysis: tea-tasting, Pingouin, statsmodels, and SciPy

Mar 01, 2026 a/b testing statistics tea-tasting python

Disclosure: I am also the author of tea-tasting.

This article compares four Python packages that are relevant to A/B test analysis: tea-tasting, Pingouin, statsmodels, and SciPy. It does not try to pick a universal winner. Instead, it clarifies what each package does well for common experimentation tasks and how much manual work is needed to produce production-style A/B test outputs.

It assumes familiarity with A/B testing basics, including randomization, p-values, and confidence intervals.

A/B test setting and analysis requirements #

A/B tests in a nutshell #

An A/B test compares two (or more) variants of a product change by randomly assigning experimental units to variants and measuring outcomes. In online experiments, the randomization unit is usually the user, and the standard assumption is that units are independent.

A typical workflow is:

Design the experiment: choose the randomization unit (usually users), define the target population, and estimate sample size and duration with power analysis.
Run the experiment: ship the treatment, randomize traffic, and collect data.
Analyze and interpret results: compute control and treatment metric values, estimate effects with confidence intervals, and report p-values.

Good references for this mindset include books and papers by Ron Kohavi and Alex Deng, especially on trustworthy experimentation, delta-method metrics, and CUPED.

Typical metric types and tests #

The table below summarizes common A/B test metric families and the tests usually applied.

Metric type	Examples	Typical test
Average	Average revenue per user, average orders per user	Welch's t-test (unequal-variance two-sample t-test)
Ratio of averages	Average revenue per order, average orders per session	Welch's t-test with variance from the delta method
Proportion	Proportion of users with at least one order	Asymptotic tests (Z-test, G-test, Pearson's chi-squared) or exact tests (Boschloo, Barnard, Fisher)

Typical metric types and tests

Typical per-metric analysis output #

For each metric, analysts usually want the same core fields:

Metric value in control.
Metric value in treatment.
Effect size estimate and confidence interval (absolute and/or relative).
P-value.

This output format is what makes A/B test analysis convenient for repeated use across many experiments.

A/B testing specifics #

Some details matter a lot in real experimentation workflows.

Relative effect size is often easier to interpret than absolute effect size. A common mistake is to divide an absolute confidence interval by the control mean. That does not generally produce a valid confidence interval for a relative effect. Use the delta method or Fieller's theorem instead (both are standard approaches for ratio-style uncertainty).
Variance reduction (especially CUPED, which uses pre-experiment covariates) is widely used to increase power. CUPED for ratio metrics is more subtle because it combines variance reduction with ratio estimators and delta-method variance.
Multiple hypothesis testing correction becomes important when you track many metrics or compare multiple variants. In practice, teams usually need a clear FWER/FDR policy and adjusted p-values (or adjusted significance thresholds) in experiment reports.
Efficiency matters. Teams often analyze many metrics across many experiments. For many tests, only aggregated statistics are required (count, mean, variance, covariance). In those cases, it is usually more efficient to compute aggregates in the data backend and send only summary statistics to Python. A convenience layer for aggregate fetching can reduce both code and latency.

Example data setup #

With that context, the examples below use the same synthetic experiment dataset generated with tea-tasting. They intentionally do not use variance reduction (no CUPED), so the package comparisons stay focused on baseline analysis.

import tea_tasting as tt

data = tt.make_users_data(return_type="pandas", rng=42, n_users=5_000)
data["has_order"] = (data["orders"] > 0).astype(int)

control = data[data["variant"] == 0]
treatment = data[data["variant"] == 1]

print(data.head(3).to_string(index=False))
#  user  variant  sessions  orders  revenue  has_order
#     0        1         1       1     5.77          1
#     1        0         3       0     0.00          0
#     2        1         1       0     0.00          0

The metrics used in the examples are:

orders_per_user: average orders per user.
users_with_orders: proportion of users with at least one order.
revenue_per_user: average revenue per user.
revenue_per_order: average revenue per order (ratio of averages).

All package examples below reuse data, control, and treatment from this setup.

Package-by-package comparison #

tea-tasting #

tea-tasting is a package specifically designed for A/B test analysis. It targets experimentation workflows directly, with metrics, relative effects, CUPED, power analysis, and concise experiment-style outputs.

Best for: teams that want an A/B-testing-first workflow with minimal glue code.

This is the most compact example among the four packages because it provides A/B-specific metric classes and a high-level Experiment API.

experiment = tt.Experiment(
    orders_per_user=tt.Mean("orders"),
    users_with_orders=tt.Proportion("has_order", correction=False),
    revenue_per_user=tt.Mean("revenue"),
    revenue_per_order=tt.RatioOfMeans("revenue", "orders"),
)
result = experiment.analyze(data)

print(result)
# metric            control treatment rel_effect_size rel_effect_size_ci pvalue
# orders_per_user     0.511     0.556            8.8%      [-0.74%, 19%] 0.0718
# users_with_orders   0.334     0.352            5.4%       [-2.4%, 14%]  0.181
# revenue_per_user     5.06      5.64             11%       [0.22%, 24%] 0.0455
# revenue_per_order    9.91      10.2            2.5%      [-3.0%, 8.3%]  0.389

A/B testing specifics:

Power analysis: built-in (Experiment.solve_power and metric parameters for effect size, relative effect size, and sample size).
Relative effect and confidence interval: first-class output (including relative CI in the printed result).
CUPED: built-in for averages and ratio metrics.
Multiple hypothesis testing correction: built-in (tt.adjust_fdr and tt.adjust_fwer) for experiment results, including FDR and FWER procedures.
Aggregated statistics workflow: built-in support via metrics that expose required aggregates and integration with data backends (e.g., through Ibis-supported engines).

Pingouin #

Pingouin is a user-friendly statistical package focused on convenient inferential statistics in pandas-centric workflows. It is strong for common tests and effect sizes, but it is not an A/B-specific framework.

Best for: quick pandas-based analyses of standard statistical tests.

Pingouin has a convenient t-test interface and a contingency-table chi-squared helper. It does not provide a built-in ratio-of-averages test with delta-method variance, so the revenue_per_order example is omitted.

import pingouin as pg

orders_test = pg.ttest(
    treatment["orders"],
    control["orders"],
    correction=True,
).iloc[0]
print(
    "orders_per_user: "
    f"control={control['orders'].mean():.3f} "
    f"treatment={treatment['orders'].mean():.3f} "
    f"effect_size="
    f"{treatment['orders'].mean() - control['orders'].mean():.3f} "
    f"effect_size_ci={orders_test['CI95']} "
    f"pvalue={orders_test['p_val']:.4f}"
)
# orders_per_user: control=0.511 treatment=0.556 effect_size=0.045 effect_size_ci=[-0.    0.09] pvalue=0.0718

_, _, tests = pg.chi2_independence(
    data,
    x="variant",
    y="has_order",
    correction=False,
)
pearson = tests.loc[tests["test"] == "pearson"].iloc[0]
print(
    "users_with_orders: "
    f"control={control['has_order'].mean():.3f} "
    f"treatment={treatment['has_order'].mean():.3f} "
    f"effect_size="
    f"{treatment['has_order'].mean() - control['has_order'].mean():.3f} "
    f"pvalue={pearson['pval']:.4f}"
)
# users_with_orders: control=0.334 treatment=0.352 effect_size=0.018 pvalue=0.1811

Notes:

revenue_per_user uses the same pg.ttest(...) pattern as orders_per_user.
revenue_per_order (ratio of averages) requires manual derivation if you want a statistically correct delta-method analysis.
The proportion example above shows a p-value, but not a built-in A/B-style effect CI.

A/B testing specifics:

Power analysis: built-in for standard cases (power_ttest, power_ttest2n, power_chi2), but not an A/B-specific multi-metric workflow.
Relative effect and confidence interval: mostly manual for A/B-style relative lift CIs.
CUPED: no built-in CUPED abstraction. You would implement variance reduction manually (for example, with regression).
Multiple hypothesis testing correction: built-in p-value adjustment via pg.multicomp (for example, Bonferroni, Holm, and FDR methods), but integration into an A/B reporting workflow is manual.
Aggregated statistics workflow: mostly expects granular arrays/DataFrames; no built-in A/B aggregate interface.

statsmodels #

statsmodels is a broad statistical modeling library with strong hypothesis testing, power analysis, and confidence interval utilities. It is less opinionated than an experimentation-specific package, which is a strength if you want building blocks and explicit control.

Best for: analysts who want mature statistical building blocks and are comfortable assembling a workflow.

The example below uses Welch-style t-tests for two average metrics and a risk-ratio test/CI for the proportion metric. revenue_per_order is omitted because there is no built-in A/B-style ratio-of-averages delta-method helper.

from statsmodels.stats.proportion import (
    confint_proportions_2indep,
    test_proportions_2indep,
)
from statsmodels.stats.weightstats import CompareMeans, DescrStatsW

def welch_summary(treatment_series, control_series):
    cm = CompareMeans(
        DescrStatsW(treatment_series),
        DescrStatsW(control_series),
    )
    _, pvalue, _ = cm.ttest_ind(usevar="unequal")
    ci_low, ci_high = cm.tconfint_diff(usevar="unequal")
    return (
        control_series.mean(),
        treatment_series.mean(),
        treatment_series.mean() - control_series.mean(),
        ci_low,
        ci_high,
        pvalue,
    )

for metric in ["orders", "revenue"]:
    ctrl, trt, effect, ci_low, ci_high, pvalue = welch_summary(
        treatment[metric],
        control[metric],
    )
    print(
        f"{metric}_per_user: "
        f"control={ctrl:.3f} treatment={trt:.3f} effect_size={effect:.3f} "
        f"effect_size_ci=[{ci_low:.3f}, {ci_high:.3f}] pvalue={pvalue:.4f}"
    )
# orders_per_user: control=0.511 treatment=0.556 effect_size=0.045 effect_size_ci=[-0.004, 0.094] pvalue=0.0718
# revenue_per_user: control=5.062 treatment=5.641 effect_size=0.579 effect_size_ci=[0.012, 1.146] pvalue=0.0455

count1 = int(treatment["has_order"].sum())
nobs1 = len(treatment)
count0 = int(control["has_order"].sum())
nobs0 = len(control)
prop_test = test_proportions_2indep(
    count1=count1,
    nobs1=nobs1,
    count2=count0,
    nobs2=nobs0,
    compare="ratio",
    method="log",
)
prop_ci = confint_proportions_2indep(
    count1=count1,
    nobs1=nobs1,
    count2=count0,
    nobs2=nobs0,
    compare="ratio",
    method="log",
)
print(
    "users_with_orders: "
    f"control={control['has_order'].mean():.3f} "
    f"treatment={treatment['has_order'].mean():.3f} "
    f"rel_effect_size={prop_test.ratio - 1:.3f} "
    f"rel_effect_size_ci=[{prop_ci[0] - 1:.3f}, {prop_ci[1] - 1:.3f}] "
    f"pvalue={prop_test.pvalue:.4f}"
)
# users_with_orders: control=0.334 treatment=0.352 rel_effect_size=0.054 rel_effect_size_ci=[-0.024, 0.138] pvalue=0.1811

A/B testing specifics:

Power analysis: strong built-in support (TTestIndPower, NormalIndPower, GofChisquarePower, and more).
Relative effect and confidence interval: partial. There are built-in options for some cases (for example, risk ratios for proportions), but no general A/B-style relative lift CI interface across metric types.
CUPED: no built-in one-call CUPED API, but it is practical to implement manually with regression tooling.
Multiple hypothesis testing correction: strong built-in support (statsmodels.stats.multitest, including multipletests and fdrcorrection).
Aggregated statistics workflow: partial. Proportion tests work directly from counts/sample sizes; other workflows often still require granular arrays or more manual setup.

SciPy #

SciPy is a foundational scientific computing and statistics package used directly or indirectly by many higher-level libraries, including the others in this comparison. It provides robust hypothesis tests and exact tests, but it does not provide a high-level A/B testing workflow.

Best for: low-level building blocks and custom A/B analysis code.

This snippet shows a Welch t-test for orders_per_user and a Pearson chi-squared test for users_with_orders. Exact tests such as Fisher, Barnard, and Boschloo are also available in SciPy.

import numpy as np
from scipy import stats

orders_test = stats.ttest_ind(
    treatment["orders"],
    control["orders"],
    equal_var=False,
)
orders_ci = orders_test.confidence_interval()
print(
    "orders_per_user: "
    f"control={control['orders'].mean():.3f} "
    f"treatment={treatment['orders'].mean():.3f} "
    f"effect_size="
    f"{treatment['orders'].mean() - control['orders'].mean():.3f} "
    f"effect_size_ci=[{orders_ci.low:.3f}, {orders_ci.high:.3f}] "
    f"pvalue={orders_test.pvalue:.4f}"
)
# orders_per_user: control=0.511 treatment=0.556 effect_size=0.045 effect_size_ci=[-0.004, 0.094] pvalue=0.0718

contingency = np.array(
    [
        [(control["has_order"] == 0).sum(), (control["has_order"] == 1).sum()],
        [
            (treatment["has_order"] == 0).sum(),
            (treatment["has_order"] == 1).sum(),
        ],
    ]
)
chi2_res = stats.contingency.chi2_contingency(contingency, correction=False)
print(
    "users_with_orders: "
    f"control={control['has_order'].mean():.3f} "
    f"treatment={treatment['has_order'].mean():.3f} "
    f"effect_size="
    f"{treatment['has_order'].mean() - control['has_order'].mean():.3f} "
    f"pvalue={chi2_res.pvalue:.4f}"
)
# users_with_orders: control=0.334 treatment=0.352 effect_size=0.018 pvalue=0.1811

Notes:

revenue_per_user uses the same stats.ttest_ind(..., equal_var=False) pattern as orders_per_user.
revenue_per_order (ratio of averages) requires manual delta-method implementation.
Relative effect confidence intervals are also manual.

A/B testing specifics:

Power analysis: mostly manual in SciPy (or delegated to custom code / another package).
Relative effect and confidence interval: manual.
CUPED: manual.
Multiple hypothesis testing correction: partial. SciPy provides scipy.stats.false_discovery_control for BH/BY FDR adjustment, but broader multiple-comparison correction workflows are more limited than in statsmodels.
Aggregated statistics workflow: partial. SciPy supports some summary-statistics and contingency-table tests (for example, ttest_ind_from_stats and chi-squared/exact tests on contingency tables), but not an A/B-specific aggregate workflow.

Feature comparison for A/B testing #

SciPy underpins much of the Python statistics ecosystem. In principle, all the capabilities discussed here can be implemented with NumPy + SciPy plus custom code. The practical question is convenience and code verbosity. For that reason, the table below uses three labels:

built-in: directly supported in a way that fits common A/B analysis tasks.
partial: some built-in support exists, but not as a complete or ergonomic A/B workflow.
manual: possible, but requires custom implementation/glue code.

Feature	tea-tasting	Pingouin	statsmodels	SciPy
Power analysis to estimate required number of observations	built-in	built-in	built-in	manual
Welch's t-test or Student's t-test for analysis of averages	built-in	built-in	built-in	built-in
Welch's t-test with delta method for analysis of ratios of averages	built-in	manual	manual	manual
Two-sample proportion Z-test, G-test, or Pearson's chi-squared test	built-in	built-in	built-in	built-in
Relative effect size confidence intervals	built-in	manual	partial	manual
Variance reduction with CUPED for analysis of averages and ratios of averages	built-in	manual	manual	manual
Multiple hypothesis testing correction (FWER/FDR p-value adjustment)	built-in	built-in	built-in	partial
Working with aggregated statistics instead of granular data	built-in	manual	partial	partial

Feature comparison for A/B testing

Conclusions #

The four packages sit at different levels of abstraction.

tea-tasting is the most A/B-specific option in this group. It is designed around metrics, experiments, relative effects, CUPED, and aggregate-based workflows.
Pingouin is convenient for standard statistical tests and quick analysis in pandas, but A/B-specific workflows (especially ratio metrics, relative CIs, and CUPED) are mostly manual.
statsmodels provides strong statistical building blocks and power analysis. It is a good choice when you want explicit control and are willing to assemble an experimentation workflow yourself.
SciPy is the essential foundation. It can support almost everything with enough custom code, but it is the most verbose option for repeated A/B test reporting.

If you run many experiments with multiple metrics and need consistent outputs, the main differentiator is not just statistical correctness. It is how much A/B-specific workflow a package gives you out of the box.

Inclusion criteria for the comparison #

For transparency, here are the minimum criteria I used to decide which packages to include.

Maintained: a recent release (for example, within a year) and recent commits (for example, within two months) reduce the risk of stale APIs and unresolved compatibility issues.
Well documented: a package should have both a user guide and an API reference, because A/B testing code is often reused by analysts with different levels of statistical depth.
Used by a community: a practical heuristic is at least 100 GitHub stars. This is not a quality guarantee, but it often means more examples and more edge cases have already been surfaced.

Scope note: This comparison focuses on frequentist A/B testing workflows. Bayesian-first experimentation frameworks are out of scope.

Excluded notable packages (and why):

spotify_confidence: excluded because it lacks documentation and has not added significant new features in the last couple of years.
ambrosia: excluded because it has not added new features in the last couple of years, aside from dependency version updates.

Note: The maintenance and documentation notes in this section are assessed as of March 1, 2026.

Resources #

The comparison above is based on the public documentation and APIs of the packages as of March 1, 2026. Current stable versions on PyPI at the time of writing:

tea-tasting: 1.12.0, PyPI.
Pingouin: 0.6.0, PyPI.
statsmodels: 0.14.6, PyPI.
SciPy stats: 1.17.1, PyPI.

A/B testing and statistics references: