Chi-Square Goodness of Fit Test
What Is the Chi-Square Goodness of Fit Test?
The Chi-Square goodness of fit test is a statistical procedure that tells you whether the observed distribution of a single categorical variable matches a distribution you expected or predicted. In simple terms, it answers the question: does my data fit the pattern I thought it would?
Imagine you roll a six-sided die 120 times. If the die is fair, you would expect each face to come up about 20 times. But suppose you observe that the six came up 35 times while the one only came up 12 times. Is the die loaded, or could this just be normal random variation? The Chi-Square goodness of fit test helps you decide.
Why Do We Need It?
Whenever you have categorical data — data that falls into distinct groups or categories — and a theory about how those categories should be distributed, this test lets you check reality against expectation. The applications are wide-ranging. A geneticist might want to verify that offspring appear in the ratios predicted by Mendelian genetics. A market researcher might check whether customer preferences are evenly split across four product designs. A political scientist might test whether voter turnout across districts matches the proportions predicted by a demographic model.
Without a formal test, you would be relying on your intuition to judge whether the observed numbers are “close enough” to the expected ones. The goodness of fit test replaces guesswork with a principled statistical answer.
How Does It Work?
The procedure compares observed frequencies (what you actually counted) with expected frequencies (what your theory predicts). For each category, you calculate the difference between observed and expected, square that difference, and divide by the expected frequency. Then you sum these values across all categories to get the Chi-Square statistic.
A Chi-Square statistic of zero would mean your observed data perfectly matches the expected distribution. The further the statistic is from zero, the worse the fit. You then compare this statistic to a Chi-Square distribution with the appropriate degrees of freedom — calculated as the number of categories minus one — to obtain a p-value.
In the dice example, there are six categories (one for each face), so the degrees of freedom would be 6 − 1 = 5. If the resulting p-value is less than your significance level (commonly 0.05), you conclude that the observed distribution does not fit the expected one — in this case, evidence that the die may not be fair.
A Research Example
Suppose a university surveys 500 graduating students about their satisfaction with their degree program, offering four response options: very satisfied, somewhat satisfied, somewhat dissatisfied, and very dissatisfied. The university expects, based on national data, that the responses should split as 40%, 30%, 20%, and 10% respectively. The expected frequencies would therefore be 200, 150, 100, and 50.
After collecting the data, they find the actual counts are 170, 160, 110, and 60. A goodness of fit test would determine whether these deviations from the expected pattern are large enough to be statistically significant, or whether they fall within the range of normal sampling variation.
How Is It Different from the Chi-Square Test of Independence?
It is easy to confuse the two Chi-Square tests, but they answer different questions. The test of independence involves two categorical variables and asks whether they are related to each other (for example, is there a relationship between gender and voting preference?). The goodness of fit test involves only one categorical variable and asks whether its observed distribution matches a specific expected pattern.
Think of it this way: the test of independence asks “are these two things connected?” while the goodness of fit asks “does this one thing look the way I predicted?”
Key Assumptions
For the test to produce reliable results, several conditions should be met:
- Categorical data — The variable must be categorical. Each observation falls into exactly one category.
- Independence of observations — Each observation should be independent of the others. One person’s response should not influence another’s.
- Sufficient expected frequencies — A widely used guideline is that all expected frequencies should be at least 5. When expected counts are very small, the Chi-Square approximation may not be accurate, and alternative approaches (such as combining categories or using exact tests) should be considered.
- Random sampling — The data should come from a random sample so that the findings can be generalized to the broader population.
Interpreting Your Result
If the p-value is below your chosen significance level, you reject the null hypothesis that the observed data fits the expected distribution. This does not tell you which categories are responsible for the poor fit — only that the overall pattern differs from what was expected. To pinpoint where the discrepancies lie, you can examine the individual components of the Chi-Square statistic for each category. Categories with the largest contributions are the ones where observed and expected counts diverge the most.
Conversely, a non-significant result means you do not have enough evidence to say the observed distribution differs from the expected one. It does not prove the fit is perfect — only that any deviations are small enough to be consistent with random chance.