tea-tasting: a Python package for the statistical analysis of A/B tests

Jul 29, 2024 a/b testing statistics tea-tasting python

Intro #

I developed tea-tasting, a Python package for the statistical analysis of A/B tests featuring:

Student's t-test, Bootstrap, variance reduction with CUPED, power analysis, and other statistical methods and approaches out of the box.
Support for a wide range of data backends, such as BigQuery, ClickHouse, PostgreSQL/GreenPlum, Snowflake, Spark, Pandas, and 20+ other backends supported by Ibis.
Extensible API: define custom metrics and use statistical tests of your choice.
Convenient API for reducing manual work, and a framework for minimizing errors.
Detailed documentation.

In this blog post, I explore each of these advantages of using tea-tasting in the analysis of experiments.

If you are eager to try it, check the documentation.

Statistical methods #

tea-tasting includes statistical methods and techniques that cover most of what you might need in the analysis of experiments.

Analyze metric averages and proportions with the Student's t-test and the Z-test. Or use Bootstrap to analyze any other statistic of your choice. And there is a predefined method for the analysis of quantiles using Bootstrap. tea-tasting also detects mismatches in the sample ratios of different variants of an A/B test.

tea-tasting applies delta method for the analysis of ratios of averages. For example, average number of orders per average number of sessions, assuming that session is not a randomization unit.

Use pre-experiment data, metric forecasts, or other covariates to reduce variance and increase the sensitivity of an experiment. This approach is also known as CUPED or CUPAC.

The calculation of confidence intervals for percentage change in Student's t-test and Z-test can be tricky. Just taking confidence interval for absolute change and dividing it by control average will produce a biased result. tea-tasting applies delta method to calculate the correct interval.

Analyze statistical power for Student's t-test and Z-test. There are three possible options:

Calculate the effect size, given statistical power and the total number of observations.
Calculate the total number of observations, given statistical power and the effect size.
Calculate statistical power, given the effect size and the total number of observations.

Learn more in the detailed user guide.

The roadmap includes:

Multiple hypotheses testing:
- Family-wise error rate: Holm–Bonferroni method.
- False discovery rate: Benjamini–Hochberg procedure.
A/A tests and simulations to analyze power of any statistical test.
More statistical tests:
- Asymptotic and exact tests for frequency data.
- Mann–Whitney U test.
Sequential testing: always valid p-value with mSPRT.

You can define a custom metric with a statistical test of your choice.

Data backends #

There are many different databases and engines for storing and processing experimental data. And in most cases it's not efficient to pull the detailed experimental data into a Python environment. Many statistical tests, such as the Student's t-test or the Z-test, require only aggregated data for analysis.

For example, if the raw experimental data are stored in ClickHouse, it's faster and more efficient to calculate counts, averages, variances, and covariances directly in ClickHouse rather than fetching granular data and performing aggregations in a Python environment.

Querying all the required statistics manually can be a daunting and error-prone task. For example, analysis of ratio metrics and variance reduction with CUPED require not only number of rows and variance, but also covariances. But don't worry—tea-tasting does all this work for you.

tea-tasting accepts data either as a Pandas DataFrame or an Ibis Table. Ibis is a Python package which serves as a DataFrame API to various data backends. It supports 20+ backends including BigQuery, ClickHouse, PostgreSQL/GreenPlum, Snowflake, and Spark. You can write an SQL query, wrap it as an Ibis Table, and pass it to tea-tasting.

Keep in mind that tea-tasting assumes that:

Data is grouped by randomization units, such as individual users.
There is a column indicating variant of the A/B test (typically labeled as A, B, etc.).
All necessary columns for metric calculations (like the number of orders, revenue, etc.) are included in the table.

Some statistical methods, like Bootstrap, require granular data for the analysis. In this case, tea-tasting fetches the detailed data as well.

Learn more in the guide on data backends.

Convenient API #

You can perform all the tasks listed above using just NumPy, SciPy, and Ibis. In fact, tea-tasting uses these packages under the hood. What tea-tasting offers on top is a convenient higher-level API.

It's easier to show than to describe. Here is the basic example:

import tea_tasting as tt


data = tt.make_users_data(seed=42)

experiment = tt.Experiment(
    sessions_per_user=tt.Mean("sessions"),
    orders_per_session=tt.RatioOfMeans("orders", "sessions"),
    orders_per_user=tt.Mean("orders"),
    revenue_per_user=tt.Mean("revenue"),
)

result = experiment.analyze(data)
print(result)
#>             metric control treatment rel_effect_size rel_effect_size_ci pvalue
#>  sessions_per_user    2.00      1.98          -0.66%      [-3.7%, 2.5%]  0.674
#> orders_per_session   0.266     0.289            8.8%      [-0.89%, 19%] 0.0762
#>    orders_per_user   0.530     0.573            8.0%       [-2.0%, 19%]  0.118
#>   revenue_per_user    5.24      5.73            9.3%       [-2.4%, 22%]  0.123

The two-stage approach, with separate parametrization and inference, is common in statistical modeling. This separation helps in making the code more modular and easier to understand.

tea-tasting performs calculations that can be tricky and error-prone:

Analysis of ratio metrics with delta method.
Variance reduction with CUPED/CUPAC (also in combination with the delta method for ratio metrics).
Calculation of confidence intervals for both absolute and percentage change.
Analysis of statistical power.

It also provides a framework for representing experimental data to avoid errors. Grouping the data by randomization units and including all units in the dataset is important for correct analysis.

In addition, tea-tasting provides some convenience methods and functions, such as pretty formatting of the result and a context manager for metric parameters.

Documentation #

Last but not least: documentation. I believe that good documentation is crucial for tool adoption. That's why I wrote several user guides and an API reference.

I recommend starting with the example of basic usage in the user guide. Then you can explore specific topics, such as variance reduction or power analysis, in the same guide.

See the guide on data backends to learn how to use a data backend of your choice with tea-tasting.

See the guide on custom metrics if you want to perform statistical test that is not included in tea-tasting.

Use the API reference to explore all parameters and detailed information about the functions, classes, and methods available in tea-tasting.

Conclusions #

There are a variety of statistical methods that can be applied in the analysis of an experiment. But only a handful of them are actually used in most cases.

On the other hand, there are methods specific to the analysis of A/B tests that are not included in the general purpose statistical packages like SciPy.

tea-tasting functionality includes the most important statistical tests, as well as methods specific to the analysis of A/B tests.

tea-tasting provides a convenient API that helps to reduce the time spent on analysis and minimize the probability of error.

In addition, tea-tasting optimizes computational efficiency by calculating the statistics in the data backend of your choice, where the data are stored.

With the detailed documentation, you can quickly learn how to use tea-tasting for the analysis of your experiments.

P.S. Package name #

The package name "tea-tasting" is a play on words that refers to two subjects:

Lady tasting tea is a famous experiment which was devised by Ronald Fisher. In this experiment, Fisher developed the null hypothesis significance testing framework to analyze a lady's claim that she could discern whether the tea or the milk was added first to the cup.
"tea-tasting" phonetically resembles "t-testing" or Student's t-test, a statistical test developed by William Gosset.