Social Science Statistics

What Is Cohen's Kappa?

Cohen's Kappa is a statistic that measures how well two raters or judges agree when they classify items into categories. The key insight behind Kappa is that some agreement between raters will happen purely by chance, and we need a way to account for that. Kappa tells you how much agreement exists over and above what you would expect from random guessing alone.

Why Simple Percent Agreement Falls Short

Imagine two doctors are independently reviewing 100 X-rays and classifying each one as either "normal" or "abnormal." After they finish, you compare their ratings and find they agree on 85 out of 100 cases. That sounds impressive — 85% agreement! But here is the problem: if 90% of the X-rays are genuinely normal, then even two doctors who are just guessing "normal" most of the time would agree quite often simply by coincidence. Percent agreement does not separate genuine consensus from luck.

This is exactly why Cohen's Kappa was developed. It adjusts for the agreement you would expect by chance, giving you a much more honest picture of how consistently the two raters are actually making the same judgments.

How Kappa Works

The formula for Cohen's Kappa compares two quantities: the observed agreement and the expected agreement. Observed agreement is straightforward — it is the proportion of cases where both raters gave the same classification. Expected agreement is the proportion of cases where the two raters would be expected to agree if they were assigning categories independently of each other, based on how often each rater uses each category overall.

Kappa is calculated as: (observed agreement − expected agreement) / (1 − expected agreement). The numerator captures how much the actual agreement exceeds chance, and the denominator represents the maximum possible improvement over chance. This ratio gives a value that typically ranges from 0 to 1, where 0 means the raters agree only as much as chance predicts, and 1 means they agree perfectly. Negative values are possible and indicate that the raters agree even less than chance would predict, which usually signals a serious problem with the rating process.

Interpreting Kappa Values

Researchers commonly use the benchmarks proposed by Landis and Koch (1977) to interpret Kappa values. While these thresholds are not absolute rules, they provide a useful starting point:

< 0.00: Poor agreement (less than chance)
0.00 – 0.20: Slight agreement
0.21 – 0.40: Fair agreement
0.41 – 0.60: Moderate agreement
0.61 – 0.80: Substantial agreement
0.81 – 1.00: Almost perfect agreement

Returning to our example, if the two doctors have a Kappa of 0.65, you would describe their agreement as "substantial." This means they are agreeing well beyond what chance alone would produce, though there is still some room for improvement.

A Concrete Example

Imagine a researcher studying how reliably two psychologists can diagnose depression from clinical interviews. Each psychologist independently interviews 80 patients and classifies each one as either "depressed" or "not depressed." After comparing their classifications, the researcher calculates the observed agreement at 0.80 and the expected agreement at 0.50. Plugging into the formula: Kappa = (0.80 − 0.50) / (1 − 0.50) = 0.30 / 0.50 = 0.60. This Kappa of 0.60 falls right at the boundary of "moderate" agreement, suggesting the psychologists agree reasonably well but that there is meaningful inconsistency in their diagnoses.

When to Use Cohen's Kappa

Cohen's Kappa is appropriate whenever exactly two raters classify items into distinct categories. It is widely used in medical research (do two doctors agree on a diagnosis?), psychology (do two coders agree on the emotion expressed in a video?), content analysis (do two reviewers categorize articles the same way?), and many other fields. The categories must be the same for both raters, and the same set of items must be rated by both.

Key Assumptions and Limitations

There are a few important things to keep in mind. First, Cohen's Kappa is designed for exactly two raters. If you have more than two raters, you would need a different statistic, such as Fleiss' Kappa. Second, the categories should be mutually exclusive — each item belongs to one and only one category. Third, Kappa can be affected by the prevalence of categories. When one category is much more common than the others, Kappa values tend to be lower even when agreement is high, a phenomenon sometimes called the "Kappa paradox." This means you should always consider the context when interpreting your result, rather than relying on the benchmarks alone.

Despite these limitations, Cohen's Kappa remains one of the most widely used and trusted measures of inter-rater reliability. It provides a principled way to evaluate whether two judges are making consistent decisions, accounting for the agreement that would arise from chance alone.