Social Science Statistics

Why P-Values Are Not the Whole Story

When researchers run a statistical test, they typically get a p-value that tells them whether a result is "statistically significant." But statistical significance and practical importance are not the same thing. A study with thousands of participants might find a statistically significant difference between two groups that is, in real-world terms, tiny and meaningless. Conversely, a small study might miss a genuinely important difference simply because it did not have enough participants. The p-value tells you whether an effect is likely real; it does not tell you whether the effect is large enough to matter.

This is where effect size comes in. Effect size is a way of quantifying how big a difference or relationship actually is, regardless of sample size. One of the most commonly used measures of effect size is Cohen's d, which is designed for comparing the means of two groups.

What Is Cohen's d?

Cohen's d expresses the difference between two group means in terms of standard deviations. The standard deviation is a measure of how spread out the scores in a group are. By dividing the difference between the means by the standard deviation, you get a number that is free from the original units of measurement. This makes it possible to compare effect sizes across completely different studies and different types of measurements.

The basic formula is: d = (Mean of Group 1 − Mean of Group 2) / Pooled Standard Deviation. The pooled standard deviation is a combined measure of variability that takes into account the spread within both groups. A larger value of d means the two groups are further apart relative to their variability, while a value close to zero means the groups are very similar.

A Concrete Example

Imagine a researcher studying whether a new teaching method improves exam scores. She randomly assigns 50 students to a traditional lecture group and 50 students to the new method group. After the course, the lecture group has a mean exam score of 72 with a standard deviation of 10, and the new method group has a mean of 78 with a standard deviation of 10. The difference in means is 6 points, and the pooled standard deviation is 10. So Cohen's d = 6 / 10 = 0.6. This tells you that the new method group scored, on average, 0.6 standard deviations higher than the lecture group.

Interpreting Cohen's d: The Benchmarks

Jacob Cohen, the statistician who developed this measure, proposed a set of rough guidelines for interpreting d values:

Small effect (d = 0.2): The difference exists but is subtle. You probably would not notice it without careful measurement. Think of the height difference between 15-year-old and 16-year-old girls.
Medium effect (d = 0.5): The difference is noticeable and likely meaningful in practice. This is roughly the kind of difference you can start to see with the naked eye when comparing two groups.
Large effect (d = 0.8): The difference is substantial and obvious. The two groups clearly differ in a way that has real practical consequences.

In our teaching example, d = 0.6 falls between medium and large, suggesting the new method produces a meaningful improvement that would be worth considering in practice.

Thinking About Overlapping Distributions

Another way to understand Cohen's d is to think about how much the two groups overlap. When d is zero, the two groups are sitting right on top of each other — their score distributions overlap completely. As d increases, the distributions pull apart. With a small effect (d = 0.2), there is still about 85% overlap between the groups. With a medium effect (d = 0.5), overlap drops to around 67%. With a large effect (d = 0.8), the overlap is roughly 53%. Even with a large effect size, the groups still overlap substantially, which is a helpful reminder that group-level differences do not mean every individual in one group outperforms every individual in the other.

Reporting Effect Size

Most major research organizations, including the American Psychological Association, now recommend that researchers report effect sizes alongside p-values. This is because effect sizes help readers judge whether a finding has practical significance, allow comparisons across studies, and are essential for a technique called meta-analysis, where researchers combine results from many studies to reach broader conclusions. When you report Cohen's d, it is good practice to include a confidence interval as well, so readers can see the range of plausible values for the true effect size.

Key Things to Remember

Cohen's d works best when comparing two groups on a continuous measure (like test scores, reaction times, or blood pressure). It assumes that the scores in both groups are roughly normally distributed and that the two groups have similar variability. The benchmarks of 0.2, 0.5, and 0.8 are useful starting points, but Cohen himself cautioned that what counts as a "small" or "large" effect depends on the context. In some fields, even a small effect size can have enormous practical importance — for instance, a medication that reduces heart attack risk by a small amount could save thousands of lives when applied across an entire population.