Social Science Statistics

What Is Multiple Regression?

Multiple regression is an extension of simple linear regression that allows you to predict an outcome variable (Y) using two or more predictor variables (X₁, X₂, X₃, and so on). In the real world, outcomes are rarely determined by a single factor. A student's exam performance depends not just on hours of study, but also on sleep quality, prior knowledge, and motivation. Multiple regression lets you build a model that accounts for several of these factors simultaneously, giving you a more complete and realistic picture of what drives the outcome.

Why Do We Need It?

Imagine a researcher who finds that ice cream sales and drowning rates are positively correlated. Does ice cream cause drowning? Of course not — both are driven by a third variable: hot weather. This is the problem of confounding. Multiple regression helps address this by allowing you to include potential confounds as additional predictors. By adding temperature to the model, the researcher can see whether the relationship between ice cream sales and drowning rates persists after accounting for the effect of temperature. This ability to "control for" other variables is one of the most powerful features of multiple regression.

Interpreting the Coefficients

In a multiple regression model, each predictor variable gets its own coefficient (often called a "b weight" or "beta coefficient"). The crucial point is that each coefficient represents the effect of that predictor while holding all the other predictors constant. This is sometimes described as the "unique contribution" of each variable.

Imagine a researcher studying factors that predict job satisfaction. The model includes three predictors: salary (in thousands), commute time (in minutes), and team size. The resulting equation might be: Job Satisfaction = 50 + 0.8(Salary) − 0.3(Commute Time) + 0.1(Team Size). The coefficient of 0.8 for salary means that for every additional thousand pounds in salary, job satisfaction increases by 0.8 points on average, assuming commute time and team size stay the same. The negative coefficient for commute time (−0.3) means that each extra minute of commuting is associated with a 0.3-point decrease in satisfaction, all else being equal.

R-Squared and Adjusted R-Squared

As with simple regression, R-squared (R²) tells you the proportion of variance in the outcome that is explained by the predictors. However, there is a catch: adding more predictors to a model will almost always increase R-squared, even if the new predictors are not genuinely useful. This is because R-squared never penalises you for adding variables that contribute only a trivial amount of explanatory power.

This is where adjusted R-squared comes in. Adjusted R-squared accounts for the number of predictors in the model. It increases only if a new predictor improves the model more than would be expected by chance, and it decreases if a predictor adds noise without meaningful explanatory value. When comparing models with different numbers of predictors, adjusted R-squared is the more honest measure of fit.

Multicollinearity

Multicollinearity occurs when two or more predictor variables in the model are highly correlated with each other. For example, if you include both "years of education" and "highest degree obtained" as predictors, these two variables carry much of the same information. When multicollinearity is severe, the model struggles to separate the unique effect of each predictor, leading to unstable coefficients and inflated standard errors. A common diagnostic is the Variance Inflation Factor (VIF). As a rule of thumb, a VIF above 5 or 10 suggests problematic multicollinearity, and you may need to remove one of the correlated predictors or combine them into a single measure.

Key Assumptions

Multiple regression shares the same core assumptions as simple linear regression, with a few additions:

Linearity: The relationship between each predictor and the outcome should be approximately linear.
Independence: Observations should be independent of each other.
Homoscedasticity: The variance of the residuals should be roughly constant across all levels of the predictors.
Normality of residuals: The residuals should be approximately normally distributed, particularly for smaller samples.
No severe multicollinearity: The predictor variables should not be too highly correlated with one another.

When to Use Multiple Regression

Multiple regression is appropriate whenever you want to understand or predict an outcome based on several factors. It is used extensively in psychology (predicting well-being from personality traits), education (predicting academic achievement from study habits, socio- economic background, and school resources), economics (predicting consumer spending from income, interest rates, and confidence indices), and many other fields. It is especially valuable when you suspect that the relationship between one predictor and the outcome might be confounded by other variables, because it allows you to estimate each predictor's effect while statistically accounting for the rest. Whenever you have one continuous outcome and two or more potential predictors, multiple regression is worth considering.