How Many Samples Do I Need? Determining Sample Size for Statistically Significant Results
Anyone who has designed anything -- whether that be a new medicine, a design method, or even a new recipe -- has faced the question: “Is this better than what I had before?” If you’re just deciding whether or not you like a new recipe, getting an answer is straightforward. If you are in an academic or industrial setting you must also answer an even more important question: “Can I prove that this is better?”
Whether this process involves interviews or experiments, designers often need to show with statistical significance that their new design is an improvement over the old. Statistical significance requires a large enough sample size to show a meaningful difference between two groups. Picking a sample size is an exercise in walking the line between picking the smallest, least costly sample size without risking inconclusive or statistically insignificant results that waste the experiments. The sample size you choose needs to be in the “goldilocks zone” of not too much, not too little, but just right.
Thankfully, picking a sample size doesn’t have to be a guessing game. The statistical tool of power analysis determines how big of sample size is needed based on a few statistical parameters. In this article, we explore how to perform a power analysis to pick a sample size, and what can be done to increase statistical power if there are limitations on the sample sizes available.
Many online calculators exist for doing power analysis calculations automatically. You can check out two of my favorites here [1] and here [2]. While you will likely use a calculator, it’s important to understand what’s going into the calculator.
Power Analysis Statistics
The basis of power analysis is that if two sample means are close, it’s likely the underlying population distributions have a lot of overlap. This means it’s likely that the samples came from the same source, or the two sources aren’t significantly different. If the two sample means are far apart, it’s likely that there isn’t much overlap between the population distributions, and that the samples likely came from two significantly different sources.
Figure 1 illustrates two hypothetical distributions H0 and H1, which could represent two different materials, or two design processes, or results from a control group versus a treatment group.. Here we see that the means are spread quite far apart, but due to the standard deviations of the sample distributions, there is a fair amount of overlap between the two. Power analysis is a statistical tool to determine whether the difference in performances of the two distributions are more likely due to real differences or randomness. Power analysis can be used to analyze the differences between two distributions, and help us predict how big of a sample size we will need to show the two distributions are actually different. Power analysis calculations take in three main inputs: statistical power, a significance threshold, and an effect size which is a summary variable that describes overlap.
Statistical power is the likelihood of an experiment correctly rejecting the null hypothesis (which is the assumption that two sources aren’t different) and determining that the two sources are different. If you can show that two sources are different, and one has a higher value than the other, then you can state that one source is better than the other. A typical value for statistical power is 0.8 or 80%. Some sample size calculators use Beta, which is equal to 1 - power. Alpha is the threshold for significance and is equal to the odds that you would get the same results due to chance. Alpha is typically set to 0.05. In Figure 1, power is shown to be the area under the normal curve to the right of the critical t-score or z-score. This critical t or z score comes from the value for H0 where the area under the H0 distribution is equal to alpha.
Effect Size is a way to summarize the difference between two distributions. While effect size is hard to visualize, one way to think about it is to think about the overlap between distributions. IF the overlap is small the effect size is large. If the overlap is large the effect size is small. It is equal to the differences in the means divided by the pooled standard deviation. If you have data from previous experiments, you can easily plug in the sample means and standard deviations to get your sample size. If you don’t, there are several ways of estimating the effect size.
What to do When You Don’t Have Enough Data
Running a preliminary experiment is the ideal choice. This provides actual information about the effect size. This greatly reduces the risk associated with estimating a sample size. However, performing these experiments may be costly or not possible, such as when a very large preliminary sample size would be needed to determine effect size, or if a sample size is needed before experiments for funding or experiment approval.
Thankfully there are some reliable options for estimating required sample sizes. Many fields used standard effect size estimates based on whether they believe that the difference between two things is small (0.2), medium (0.5) or large (0.8) [3]. Small effect sizes require large samples to prove significance, and large effect sizes only need small samples. Effect size values of 0.2, 0.5, and 0.8 originated from the field of Behavioral Science, but have wide application. These effect sizes correspond to required sample sizes of 394, 64 and 26 respectively. Other fields have their own effect size standards.
When standards aren’t available, we can typically make a pretty good guess about the differences between the means of two distributions and the variability. Some areas have high variability and small effect sizes such as slightly improved medical treatments. Design methods typically have high variability due to human factors and variability. This can make it difficult to show that a difference between two different methods isn’t due to chance. For this article I researched the effect sizes in Design Theory & Methodology Papers and found that many design methodologies have effect sizes much larger than the typical estimates of 0.2, 0.5 and 0.8 (See Figure 2).
Illustrating the Effects of Distance Between Means and Variability on Sample Size
The greater the difference in performance between two design methods, the smaller sample size is needed to show a difference. If you are developing a method, product, or treatment, improvements in your design not only lead to greater results, it is also easier to prove that one method is better than another. Another factor to consider is how you are measuring the difference between groups. If you are measuring performance using a 1-4 rating scale, you will not be likely to show a strong difference because the measurement doesn’t allow for large differences. So if you are having a hard time showing statistical significance, a change in how you are measuring performance might be what you need.
The other factor in effect size is the variability which is measured by pooled standard deviations. The greater the variation within a group the larger sample size is needed to show a difference. This variability always exists and can come from a variety of sources. Manufactured parts typically have very little variability. Humans are all so unique and different that experiments involving humans typically have much higher variability. While some level of variability will always exist, certain measures can help limit measurement variability. Variability can be decreased by controlling for a variety of factors and by running experiments more carefully,
Conclusion
Hopefully this article has helped take the guesswork out of picking a sample size. Answering the two big questions “Is this better than what I had before?” and “Can I prove that this is better?” are much easier when you have the statistical measures to support your findings.
References
Power and Sample Size Calculator, https://www.gigacalculator.com/calculators/power-sample-size-calculator.php
Inference for Means: Comparing Two Independent Samples, https://www.stat.ubc.ca/~rollin/stats/ssize/n2.html
Cohen, Jacob. Statistical power analysis for the behavioral sciences. Academic press, 2013.