What is a Statistically Significant Result?

An Introduction to the Statistics Behind Statistical Significance in Online Experiments

Businesses and organizations rely on online experimentation to make data-driven decisions in the digital space, and more specifically to make changes to their websites, emails, social media, and more based on having significant experiment results. While decision-makers know they need significant results to implement a change, a lot of people do not know what statistical significance actually is or how (statistically) it is obtained.

It is important to understand the meaning behind the numbers you’re looking at even for non-data scientists or analysts, because even non-technical people may find themselves in a position to make a decision or present to a client, and it is very hard to do either without understanding the concepts yourself.

In short, statistical significance measures the probability that the observed experimental results are due to something other than random chance – meaning the observed change was due to a particular cause – the change that you are testing.

This post gives a high-level intro to some of the core concepts surrounding statistical experiments and more specifically understanding how statistically significant results are achieved.

When running an online experiment, you’re testing the impact of the change of some element. For simplicity’s sake, let’s say it’s the text on a button on a webpage and you want to see if the new version of the text leads to more clicks. From a design perspective you would actually be testing two things, the new text – the treatment group –  as well as the original text to serve as a control or baseline – this is called an A/B design. You serve both variants of the webpage to two different groups of users and after the experiment is done, you compare the number of clicks from the treatment group to the baseline assuming that there is no difference in the number of clicks.

Depending on how extreme the difference is between the two, you can then determine if the new text is statistically significant or not – meaning whether you claim the difference was due to the implemented change.

Let’s dive into each of the factors that are important to understanding statistical significance:

  • Characterization of the Metric
  • Null and Alternative Hypotheses
  • Statical Power and Sensitivity (Variance and Error Types)
  • Power Analysis
  • P-value

Characterization of the Metric

When running an online experiment to test out a new feature, you need a way to measure the impact of implementing said feature. The metric is simply the measurement of interest that you want to see changed (or increased) as a result of your experiment.

For example, if you change the text on a call-to-action button, you might want to look at how the number of clicks changes with the treatment. However, since the number of users viewing the webpage per variant of your experiment may differ, it might not be a good idea to just sum clicks to compare variants since this won’t be an apples-to-apples comparison. It is better to use a normalized version of your metric based on sample sizes that can be compared across variants ie. click rate vs the number of clicks. Other common metrics include conversion rate, revenue-per-user, average time spent on a page, etc.

For experimentation purposes, the metric is described by its baseline mean – or some other summary statistic – and the standard error – how variable the estimate of the metric will be. The variability of the metric is important in calculating accurate sample sizes needed for the experiment as well as the statistical significance during analysis.

Note: Since you have multiple samples of data – the control versus your treatment – you need to calculate metrics for each sample. You will calculate the sample mean which is the mean of each variant and the standard error is just how different the sample mean is from the population mean (where the population is the audience that your experiment will make conclusions about).

Null And Alternative Hypotheses

In online experiments, we use hypothesis testing where we begin with the assumption of the null hypothesis that there is no difference between the metric of the variant and the baseline metric. We then calculate the probability of each sample variant’s observed metric (outcome) happening, given that they are coming from the same population (no difference between metric – null hypothesis).

Typically in experiments, a hypothesis is your prediction of the outcome of an experiment. In this case, we predict that the metric will be the same for both the control and treatment variants and are testing the probability of the treatment metric being different enough to reject the null hypothesis that there is no effect on the metric.

  • Null Hypothesis: There is no effect in the population.
  • Alternative Hypothesis: There is an effect in the population.

When we find a difference that is highly unlikely to occur given the null hypothesis, that is when we can reject the null hypothesis that the metric between the variant and the baseline are equal and accept the alternative hypothesis that the difference in the treatment metric is statistically significant.

When we find a difference that is likely to occur given the null hypothesis, we fail to reject the null hypothesis since there is not sufficient evidence to support the alternative hypothesis that there is a significant difference.

Statistical Power (Sensitivity)

Statistical power, or sensitivity, is the probability of detecting a meaningful difference where there really is one.

The smaller the variability of the metric – the tighter the spread of data around the metric – the better sensitivity there will be to detect a significant outcome. Therefore, the power (sensitivity) of an experiment can be improved by reducing the variance of the metric.

Variance

What is variance? Similar to the standard error of the mean, variance is also a metric of variability. While the standard error of the mean describes how much the mean of the sample deviates from the true mean of a population, variance describes the spread of the population.

Before reducing variance to improve sensitivity, you have to be able to calculate the variance of the metric correctly:

  • First compute the variance of the sample: var(Y) = (Y – Ymean)^2 / n-1
  • Then compute the variance of the average metric which is just the sample variance scaled by a factor of n: var(Y)/n
  • Where Y is all values, Ymean is the mean of all values and n is the number of values in the data.

Calculating the variance is important because an incorrect variance makes your p-value incorrect, with an overestimate of the variance leading to false negative results and vice versa (we will discuss the p-value and error types in just a second).

Ways to reduce variance include: using an evaluation metric with a smaller variance (ie. # of searchers vs # of searches), triggered analysis to remove noise contributed by users not affected by the treatment, randomizing the data at a more granular level, conducting paired experiments and more.

Error Types

Another factor in understanding statistical power or the probability of an accurate success is understanding the error types that can occur.

  • Type I Error: Concluding there is a significant difference when there was not.
  • Type II Error: Concluding there is not a significant difference when there is one.

Since statistical power is the probability of detecting a difference when there really is one, statistical power is just: 1 – Type II Error (not detecting a significant result when there is one.)

Statistical power is generally set to 80% which means that if there are significant results in 100 studies, only 80 of 100 statistical tests will be able to detect them. Higher power means there is a smaller chance of making a Type II Error – so why not make power 100%? Having too much power means that your experiment is highly sensitive to statistically significant results and may flag results that are not actually useful.

Power Analysis

Power analysis is the technique used to determine the sample size needed for an experiment where we try to use the minimum amount of statistical power required to detect a significant result that is relevant to the business. Typically what’s considered relevant to the business is called the minimum detectable effect or the smallest lift in your metric of interest that you are willing to be able to detect. The minimum detectable effect should be determined by the business based on historical data and what practically makes sense.

The last factor in power analysis for sample size is the significance level or alpha value, which is the maximum risk you will allow in rejecting a true null hypothesis. This value is the Type I Error probability and is typically set to 0.05, which means that the results have to have less than a 5% chance of occurring assuming the null hypothesis is true, for the results to be considered statistically significant.

Power and significance are correlated so a higher significance level leads to higher power, while a lower significance level leads to less sensitivity for true results. Balancing statistical power (opposite of Type II Error) and the significance level (Type I Error) is key and past research has shown that setting these to 80% and 5% respectively is standard practice, but feel free to adjust both according to your experiment’s needs.

In short – power analysis helps us determine the sample size that needs to be collected in an experiment in order to get a result that is statistically and practically significant to the business. Note: just because you collected your sample size does not guarantee a statistically significant result. This just means that there is now enough evidence to support a difference between the treatment and baseline metrics.

For example, if you want to detect a statistically significant 3% lift in click rate, the sample size that needs to be captured to determine a significant result (if you achieve a statistically significant result) is dependent on the statistical power (80%), the significance level (5%) and the minimum detectable effect (3%). If you collect the pre-calculated sample size and get a statistically significant result, you can be confident that it is actually statistically significant.

P-value

P-value stands for probability value and is just the probability of observing the change in the metric that you actually observed – or something more extreme – assuming the null hypothesis is true. To determine whether or not this change in metric is statistically significant, we compare the p-value to a threshold significance level – YES. This is the same significance level used in power analysis.

If the p-value is less than the significance level, we say that the difference in metric with the treatment group is statistically significant, if it is greater, we say the difference is not statistically significant. Why does this work? Because the significance level tells us the probability threshold at which we reject the null hypothesis when it is actually true and an unexpected outcome occurs.

For example, let’s say the probability of seeing a 3% increase in click rate is 0.02. Given that click rates are assumed to be the same between the treatment and the baseline (null hypothesis is true), the fact that the probability of observing the 3% increase is so small and it still happened – the fact that it still happened is what makes it statistically significant. If the p-value was 0.34, however, since the chance of this occurring is big it is not a significant result because it is expected to happen.

In Summary

Before running an experiment, you first need to characterize the metric of interest which should be the least variable possible for increased power/sensitivity in determining a statistically significant result.

You then need to determine the amount of data you need to collect. Use power analysis to determine the sample size required to have statistically significant results based on the minimum detectable effect for your organization.

After running the test, calculate the p-value for the difference in the observed metric between the variant and baseline and compare it to the significance level to reject or fail to reject the null hypothesis.

Assuming the null hypothesis is true – the significance level is the threshold probability where anything occurring with a lower probability is a significant outcome. Since the chance of the outcome occurring is so low, the fact that it did when it shouldn’t is what makes it a statistically significant result.

Hopefully, this post was able to introduce what statistical significance is and the core concepts surrounding it.

Sources:

Bhandari, P. (2023, June 22). Statistical Power and why it matters: A simple introduction. Scribbr. https://www.scribbr.com/statistics/statistical-power/

Botero, N. (2019, March 2). How to calculate variance from standard error. Sciencing. https://sciencing.com/calculate-variance-standard-error-6372721.html

Kohavi, R., Tang, D., & Yu, Y. (2020). Trustworthy online controlled experiments: A practical guide to a/B testing. Cambridge University Press.

Leave a comment