A Statistics Primer for the Overwhelmed Educator

A Statistics Primer for the Overwhelmed Educator

By Cindy Nebel

On twitter this week, our good friend Blake Harvard (@effortfuleduktr) mentioned that he didn’t know what was meant by a repeated measures ANOVA when he read about it in papers.

Image from Twitter

Image from Twitter

I’ve spent the last ten years teaching behavioral science statistics, so I thought it might be worthwhile to write up a short description of some common statistical terms that appear in psychology papers to help overwhelmed educators who might want to interpret results but don’t have time to take a course in stats!

First for some basics: most statistical analyses are based on one very important principle – is the effect bigger than what we might expect by chance? For example, if you had two groups of people with equal intelligence (let’s say an average IQ of 105 for both groups). If you randomly grabbed ten of each group and compared their averages, they likely wouldn’t be EXACTLY equal, but close (say 104 and 106). That difference is completely meaningless; it only exists because you happened to grab two slightly different groups of people. Statistics is a mathematical way of defining how big of a difference would have to exist before we conclude that these groups are, in fact, different. That difference depends on the actual difference between the averages (or in other words, the mean difference) as well as how much variability we see in the population/sample. In other words, does pretty much everyone have an IQ of 105 or are they all over the place? Less variability means that we wouldn’t expect a big difference between our groups just by chance. We can also ask if the relationship between two variables is stronger than we would expect by chance, but the same basic logic is true.

And speaking of logic, statistical logic looks similar to the logic used in the American legal system – innocent until proven guilty. This exists because of a simple point of logic: it’s much easier to prove something false than to prove something true. Here’s an example: If you wanted to prove that I’m the shortest living psychologist, how many people would you need? To prove it true, you would have to find every single living psychologist to guarantee none were shorter. If you wanted to prove it false, you would only need one psychologist shorter than me (good luck!). So, this is why we have a null hypothesis: this is the assumption that there is no effect, no relationship, no difference and our goal is to find one piece of evidence to prove it wrong.

board-2853022_1920.jpg

And then we get into the very close relationship between research design and the type of test we conduct. You can see a primer on research design here: https://www.learningscientists.org/blog/2018/3/8-1

Results sections in papers are full of letters. So let’s walk through some of these letters with the most basic explanation I can provide for each test.

z: A z-test is used to compare a sample mean to a specific value, usually a population mean. For example, let’s say I know that the average IQ is 100 and I want to see if I can increase it, so I provide an intervention and then measure IQ and get a sample mean of 107. I then use a z-test to compare 107 to 100 to see if the difference is bigger than what I would expect just by chance.

single-sample t: A single-sample t-test is exactly the same as a z-test, except we don’t know the population variability. So, we’re still asking if the sample mean is equal to some value.

Independent-samples t: This test is used to compare the means of two different groups of people.

Repeated-measures or paired-samples t: This test is used to compare two means that come from the same group of people. Examples include a pre-test post-test design (not recommended, just used for example here) or a within-subjects design where, say, you’re comparing how all of your students perform on multiple choice questions compared to short answer questions on the final exam.

ANOVA: In general an ANOVA is used to compare two or more means by again looking to see if there is more variability between the people in the study or between the sample means. It’s still, essentially, asking if the effect we see is bigger than what we expect by chance. Note that an ANOVA is one test that tells us if any effect exists, but it has to be followed up with additional “post-hoc” tests to determine which of the means is actually significantly different.

There are then multiple types of ANOVAs that map onto the types of t-tests. One-way ANOVAs are similar to independent-samples t-tests: they are used to compare the means of three or more different groups of people. Repeated-measures ANOVAs are used to compare three or more means from the same group of people.

Image from Pixabay

Image from Pixabay

Factorial ANOVAs are a bit more complicated. A factorial ANOVA means there is more than one factor being examined. Maybe we’re interested in both multiple choice vs. short answer questions as well as whether the questions are factual or applied. There are two factors that require a factorial ANOVA. There can be more factors and they can have more levels (maybe you want to compare multiple-choice to short-answer to essay – that’s 3 levels). Factorial ANOVAs allow us to examine interactions between our factors. For example, Roediger and Karpicke (2006) found that at a short delay, participants performed better when they restudied, but at a longer delay they performed better when they were tested via retrieval practice (1). That is an interaction between two variables: delay (short vs. long) and type of study (restudy vs. retrieval practice).

r: Pearson’s r is used to ask how strong of a relationship exists between two variables. The value of r can range from -1 to 1. An r-value near 0 means that there is no relationship between the variables and a value near 1 (positive or negative) means there is a perfect (maximum strength) relationship. A positive r-value means that as one variable goes up, so does the other. An example here would be the relationship between height and weight; as height goes up, weight also goes up. A negative r-value means that as one variable goes up, the other goes down. An example here would be the relationship between absences and course grades; as absences go up, course grades tend to go down.

Regression: There are many different types of regression and discussing all of them is outside the scope of this blog post, but in general, a regression is a mathematical way of plotting the relationship described by a correlation. The big advantage of using regression is that you can make predictions about what would happen outside the range of data. At the time of this writing, regression lines are becoming very popular as people are looking at linear and logistical regressions to plot the trend line for coronavirus cases and deaths and to make predictions for the coming weeks and months.

p: When statistics are reported, most typically you will see a z, t, F, or r-value referring to the statistical tests above, and then you will see a p-value. The p-value is the exact probably that this effect is due to chance (or the exact probability of obtaining results that are this extreme or more). We are therefore looking for very low values as strong evidence that there is something more going on than just random variability. Traditionally, .05 or 5% has been used as a cut off value for something being considered “statistically significant” (although this method has come under quite a bit of scrutiny in recent years). So, again, statistically significant simply means that there is sufficient evidence to indicate that the results we see are not due to chance.

Effect size: After the p-value, you will often see one more letter (d, ηp2, etc.). This is the effect size, which is exactly what it sounds like. It is the magnitude of the effect and there are different rules of thumb for what constitutes a small, medium, and large effect. This is arguably more important than statistical significance because a very small effect can still be statistically significant without being terribly important. As an example, you could have a weight loss program where you find a statistically significant amount of weight loss, demonstrating that it is real – the weight loss is likely due to the weight loss program, but the magnitude of the loss might only be 1 lb… that small of an effect might not be worth spending money on.

There is much more that could be covered here, but hopefully this is enough to help decipher the results sections of empirical articles. One more final recommendation might be to read the first paragraph or two of the discussion section before going back to reading the results as that section usually begins with a summary of the main findings.


References:

(1) Roediger, H. L., & Karpicke, J. D. (2006). Test-enhanced learning: Taking memory tests improves long-term retention. Psychological Science, 17, 249-255.