Understanding Sample Sizes and the Word “Significant”
By Megan Sumeracki
When we run an experiment (for a review of different types of research methods, see this blog), we are rarely (if ever) able to collect data from the entire population that we are interested in. Instead we try to draw a “sample” that represents that population. The desirable sample size depends, in part, on the population you are interested in.
This Fall I am teaching a graduate seminar in cognition. In this class of Master’s students, some have experience with cognitive psychology and cognitive research, while others have not had much exposure to cognition in the lab or the classroom. This diversity has led to some very interesting conversations about the importance of sample sizes in research, and subsequently what we mean when we say an effect is “significant.” It seems to me that most tend to think larger sample sizes are always better. Shouldn’t we trust a study that had 1000 people in it more than one that had 100 people in it? As with most things in life, I argue that it depends.
There are definitely situations where large sample sizes are absolutely required. Certainly, if your goal is to determine differences between a number of subgroups and generalize (i.e., be able to apply) your findings to the population of an entire country or the world, then yes you need a very large sample size. Someone who is interested in understanding differences across 20 different cultures is absolutely going to need more than a handful of participants; 100 people isn’t going to cut it. Instead you would need a much larger sample that is extremely diverse.
One big issue related to sample size requires us to talk about what the word significance means in a scientific context. In “the real world,” significant means noteworthy, or worthy of attention. However, this is not what scientists typically mean when they say significant. Often, we are talking about statistical significance, and this is a totally different thing. When we say a finding is statistically significant, what we typically mean is that two groups (or more) were found to be different, and we’re willing to say that the difference is unlikely to be due to chance.
Here’s a completely made-up concrete example: Imagine we want to see whether an extra 30 minutes in college classes improves students’ grades in those classes. One team of researchers randomly assigns 200 students to stay an extra 30 minutes in class, and another 200 students to leave at the normal time. Imagine the researchers find a small difference between the two groups, but it is not statistically significant. They conclude that there is no reason to believe that additional time in class improves students’ grades. Now, imagine another team of researchers conducts the exact same study, only this time they randomly assign 2,000 students to stay an extra 30 minutes and 2,000 students to leave at the normal time. Imagine this team of researchers does find that the group that stays in class for an extra 30 minutes earn (statistically) significantly higher grades than the group that leaves at the normal time. This means that the difference between these two groups is not likely due to chance. The probability that we accidentally found a difference between the groups is very low. So low, in fact, that scientists are willing to say the finding is “significant.”
The issue here is that statistical significance does not signify a large or meaningful effect. In the fictitious example above, the effect may not be found by the first team of researchers because the effect size is very small, and there weren’t enough participants in the study to detect the effect. All things being equal, the smaller the effect, the greater the sample size we need to find it. But there does come a point where, at least for applied research, an effect is so small that it is not meaningful. If 30 minutes extra in class is enough to increase students grades by 2%, is it worth the extra 30 minutes? What about 1%? Even less? Are there other things we could do in the classroom that would take less time and improve grades even more? The greater the sample size, the more likely we are to find a statistically significant difference between groups, but that doesn’t mean the effect we find is meaningful. With infinitely large sample sizes, we can actually find statistically significant differences between basically anything. (For more on this, see this article.)
Another thing to keep in mind while evaluating research findings is that a study with a large sample is not necessarily a study that is more generalizable, or applies to a more diverse group of people. This is because simply increasing the sample size does not necessarily mean that the study will have a diverse sample. For example, imagine a study that is conducted at an elite private high school with only girls aged 15-16. In this case increasing the sample size from 100 to 1000 is not going to allow the researchers to generalize much past girls aged 15 to 16 at an elite private school. This is not to say this study would not be valuable; if this is the population of interest and the research question is important, then so long as the study is designed and executed well the results should be informative. But in this case, a sample of 1000 girls is not necessarily better than a sample of 100 girls.
In addition to the points already mentioned, there are tons of other factors that need to be considered when evaluating and interpreting research to see if a result is meaningful and whether it should be applied in a given setting. How homogeneous (similar) is our sample and the population of interest? How much error or random variation is inherent in what we’re measuring (for example, test performance) and how we measure it (for example, in a multiple-choice test)? How many trials or repetitions are there (for example, questions on a test or ways we are assessing something)? What is the design of the study – are all participants doing all conditions, or are different groups of participants doing each one? The factors seem endless. Researchers also have to consider what types of statistics we are using, whether we are paying close attention to effect sizes, and the precision with which we can measure those effect sizes. Are there going to be replications, and are we presenting all of the data, even those results that don’t show an effect? What about meta-analysis procedures? These are all issues perhaps to be discussed in another blog!
We often have this idea that more is always better, and when we learn about basic research methods in high school or college, often this rule of thumb is taught in place of the extremely nuanced reality. However, as with many things, it really depends!