Social Science Statistics

What Is Simple Linear Regression?

Simple linear regression is a method for predicting the value of one variable based on the value of another. The variable you are trying to predict is called the dependent variable (or outcome variable, often labelled Y), and the variable you are using to make the prediction is called the independent variable (or predictor variable, often labelled X). The word "simple" tells us there is only one predictor, and "linear" tells us we are fitting a straight line to the data.

Why Do We Need It?

Correlation tells you that two variables are related, but regression goes a step further: it lets you make predictions. If you know that hours of study are positively correlated with exam performance, regression allows you to estimate how much a student's exam score is likely to increase for each additional hour of study. This ability to quantify predictions is what makes regression one of the most widely used tools in statistics.

The Line of Best Fit

Imagine plotting your data on a graph with X on the horizontal axis and Y on the vertical axis. Simple linear regression finds the straight line that comes closest to all the data points. This is called the line of best fit (or regression line). The method used to find this line is called ordinary least squares, which works by minimising the total squared distance between each data point and the line. In other words, the line is positioned so that the overall prediction errors are as small as possible.

Slope and Intercept

The regression line is described by two numbers. The slope tells you how much Y changes, on average, for each one-unit increase in X. The intercept (sometimes called the constant) tells you the predicted value of Y when X equals zero. Together, they form the regression equation: Y = intercept + slope × X.

Imagine a researcher studying the relationship between hours spent practising a musical instrument per week (X) and performance score on a standardised music exam (Y). After collecting data from 50 students, the regression equation turns out to be Y = 40 + 3.5X. The intercept of 40 means that a student who practises zero hours per week is predicted to score 40. The slope of 3.5 means that each additional hour of practice per week is associated with a 3.5-point increase in the exam score, on average.

R-Squared: Explained Variance

One of the most important outputs of a regression analysis is R-squared (R²), also called the coefficient of determination. R-squared tells you the proportion of the variability in Y that is explained by X. It ranges from 0 to 1. An R-squared of 0.60 means that 60% of the variation in exam scores can be accounted for by hours of practice, while the remaining 40% is due to other factors the model does not capture. A higher R-squared indicates a better fit, but there is no universal threshold for what counts as "good" — it depends on the research context.

Residuals

A residual is the difference between an observed value of Y and the value predicted by the regression line. If a student practises 10 hours per week, the model predicts a score of 40 + 3.5(10) = 75. If the student actually scores 80, the residual is 80 − 75 = 5. Residuals are useful for diagnosing problems with the model. If you plot the residuals and notice a pattern — for example, they fan out as X increases or form a curve — it suggests the model may not be a good fit for the data.

Key Assumptions

Simple linear regression relies on several assumptions. When these assumptions are seriously violated, the results may be misleading:

Linearity: The relationship between X and Y should be approximately linear. If the true relationship is curved, a straight line will not capture it accurately.
Independence: The observations should be independent of one another. One participant's data should not influence another's.
Homoscedasticity: This term means that the spread (variance) of the residuals should be roughly the same at every level of X. If the residuals fan out or narrow as X increases, this assumption is violated.
Normality of residuals: The residuals should be approximately normally distributed. This matters most for small samples and for constructing confidence intervals and significance tests.

Applying Simple Linear Regression

Simple linear regression is used across the social sciences. Economists use it to explore how education level predicts income. Psychologists use it to study how sleep duration predicts reaction time. Health researchers use it to examine how exercise frequency relates to blood pressure. Whenever you have one continuous predictor and one continuous outcome and you suspect a roughly linear relationship, simple linear regression is a natural starting point. If your data have multiple predictors, you would extend this approach to multiple regression, but the core logic — fitting a line, measuring how well it fits, and checking assumptions — remains the same.