Tukey HSD Post-Hoc Test
What Is the Tukey HSD Test?
The Tukey HSD (Honestly Significant Difference) test is a follow-up procedure used after an ANOVA has found a statistically significant result. While an ANOVA tells you that at least one group mean differs from the others, it does not tell you which groups differ. That is where the Tukey HSD comes in — it compares every possible pair of group means and tells you exactly where the significant differences lie.
Why Do We Need a Post-Hoc Test?
Imagine a nutritionist who tests three different diets (low-carb, Mediterranean, and high-protein) to see which leads to the most weight loss over 12 weeks. She runs an ANOVA and gets a significant result (p < 0.05). Great — she knows the diets are not all equally effective. But which diet is best? Is low-carb better than both of the others, or is the difference only between Mediterranean and high-protein?
You might think the simplest solution is to run separate t-tests for each pair: low-carb vs. Mediterranean, low-carb vs. high-protein, and Mediterranean vs. high-protein. The problem is that each t-test carries a risk of a false positive (typically 5%). When you run multiple tests, those risks add up. This accumulation is known as the familywise error rate — the probability of making at least one false positive across the entire set of comparisons. With three comparisons, the combined risk is already noticeably higher than 5%.
The Tukey HSD test solves this problem. It is specifically designed to make all possible pairwise comparisons while keeping the overall familywise error rate at your chosen significance level (usually 0.05). In other words, it protects you from claiming a difference exists when it really does not, even when you are making many comparisons at once.
How Does It Work?
The Tukey HSD test calculates a value called the Q statistic for each pair of groups. The Q statistic measures the difference between two group means, scaled by the standard error of those means. More precisely, Q equals the absolute difference between two group means divided by the standard error, which is based on the mean square error from the ANOVA and the number of observations in each group.
This Q value is then compared to a critical value from the Studentized Range Distribution — a special probability distribution that accounts for the number of groups being compared. If the calculated Q exceeds the critical value, the difference between that pair of means is declared statistically significant.
The key insight is that the critical value from the Studentized Range Distribution is larger than what you would use for a simple t-test. This higher bar is what controls the familywise error rate. It makes each individual comparison slightly more conservative so that the set of comparisons as a whole stays reliable.
Interpreting the Results
The output of a Tukey HSD test typically shows each pair of groups, the difference in their means, the Q statistic, and a p-value. Some outputs also include confidence intervals for each pairwise difference. If the confidence interval does not contain zero, the difference is significant.
Returning to the diet example, the Tukey HSD might reveal that low-carb and high-protein diets differ significantly from each other, but neither differs significantly from the Mediterranean diet. This is much more informative than the ANOVA result alone, which simply told us that something was different somewhere.
Key Assumptions
The Tukey HSD test shares the assumptions of the ANOVA it follows:
- Independence — Observations within and across groups should be independent of one another.
- Normality — The data within each group should be approximately normally distributed. The test is reasonably robust to mild violations of this when sample sizes are not too small.
- Equal variances — The variability within each group should be roughly the same. If this assumption is violated, alternative procedures such as the Games-Howell test may be more appropriate.
- Equal (or similar) sample sizes — The classic Tukey HSD assumes equal group sizes. When group sizes are unequal, a modified version called the Tukey-Kramer method is used, which adjusts the standard error calculation to accommodate the imbalance.
When Should You Use the Tukey HSD?
Use the Tukey HSD when you have three or more groups, your ANOVA has produced a significant result, and you want to know which specific pairs of groups differ. It is the most common post-hoc test in the social sciences because it strikes a good balance between statistical power (the ability to detect real differences) and control of the familywise error rate.
It is worth noting that the Tukey HSD is designed for situations where you want to compare all possible pairs. If you only planned to make a few specific comparisons before collecting data, other methods (such as planned contrasts) might be more appropriate. But when you are exploring all pairwise differences — which is common in practice — the Tukey HSD is an excellent choice.