Correlation vs Causation: Why Association Isn't Proof
# Correlation vs Causation: Why Association Isn't Proof
Imagine reading a study that states: "There is a statistically significant correlation between chocolate consumption and the number of Nobel Prizes per country (r = .79, p < .01)." Does that mean chocolate makes people smarter? Of course not. But you would be surprised how often mistakes like this appear in student papers, and even in popular science reporting.
Understanding the difference between correlation and causation is one of the most important statistical lessons you can learn. This distinction is not just academic pedantry. Misinterpreting correlation can lead to bad policies, wrong conclusions, and serious ethical problems in research.
What Is Correlation?
Correlation is a statistical measure that describes the degree and direction of the association between two variables. When we say two variables are correlated, it means they change together in a systematic way.
For example, if students who spend more time studying tend to have better grades, there is a positive correlation between study time and academic performance. If students who spend more time on social media tend to have lower grades, that is a negative correlation.
But here is the key point: correlation only describes that variables move together. It says nothing about why that happens.
Pearson's Correlation Coefficient (r)
The most commonly used correlation coefficient is Pearson's r. It measures the linear association between two continuous variables.
Range of values: from -1 to +1
- •r = +1 indicates a perfect positive linear relationship (as one variable increases, the other increases in perfectly proportional fashion)
- •r = 0 indicates no linear relationship
- •r = -1 indicates a perfect negative linear relationship (as one increases, the other decreases in perfectly proportional fashion)
Assumptions for Pearson's r:
- Both variables are continuous (interval or ratio scale)
- The relationship between variables is linear (check with a scatter plot)
- No extreme outliers distorting the results
- Variables are approximately normally distributed
If you are interested in how correlation fits into the broader context of statistical tests, see also the explanation of the t-test as one of the most common procedures for testing differences.
Cohen's Conventions for Effect Size
Jacob Cohen proposed guidelines for interpreting the magnitude of a correlation:
| Effect Size | Pearson's r |
|---|---|
| Small | .10 |
| Medium | .30 |
| Large | .50 |
These values are guidelines, not absolutes. In some fields, a correlation of .30 can be extremely meaningful (e.g., in clinical psychology), while in others a correlation of .50 might be expected and not particularly informative.
It is worth remembering that the coefficient of determination (r-squared) is often more useful for interpretation. If r = .50, then r-squared = .25, meaning one variable explains 25% of the variance in the other. A correlation of .30 explains only 9% of the variance, which is actually quite small.
Spearman's Rho: An Alternative to Pearson
When variables are not normally distributed, when they are ordinal (e.g., a Likert scale with few points), or when the relationship is monotonic but not linear, a better choice is Spearman's rank correlation coefficient (rho).
Spearman's rho works by ranking the values of both variables and then computing Pearson's r on the ranks. Its advantages:
- •Does not assume a normal distribution
- •Robust to outliers
- •Can detect any monotonic relationship (not just linear ones)
For example, if you are investigating the association between wish-list ranking and purchase satisfaction, Spearman's rho is the right choice because you are working with ordinal data.
The Correlation Matrix: How to Read It
When you have more than two variables, correlations are displayed in a correlation matrix. This is a table where the rows and columns are variables, and each cell contains the correlation coefficient for the corresponding pair.
Example: Imagine you are examining the relationships among self-esteem, anxiety, academic achievement, and social support in a sample of university students.
| Self-esteem | Anxiety | Achievement | Social support | |
|---|---|---|---|---|
| Self-esteem | 1.00 | -.42** | .35** | .51** |
| Anxiety | -.42** | 1.00 | -.28* | -.33** |
| Achievement | .35** | -.28* | 1.00 | .22* |
| Social support | .51** | -.33** | .22* | 1.00 |
How to read this matrix:
- The diagonal is always 1.00 (every variable correlates perfectly with itself)
- The matrix is symmetric (the correlation of A with B equals the correlation of B with A)
- Asterisks indicate statistical significance ( p < .05, * p < .01)
- Negative values indicate an inverse association (e.g., higher anxiety, lower academic achievement)
Spurious Correlations: When Statistics Mislead
There is an entire website (tylervigen.com/spurious-correlations) dedicated to amusing correlations that are statistically significant but completely meaningless. Here are some classic examples.
Ice cream and drowning. Ice cream sales and the number of drowning deaths are highly correlated. Does this mean ice cream causes drowning? No. Both increase during summer because it is hot. Temperature is the third variable influencing both.
Nicolas Cage and swimming pools. The number of Nicolas Cage films released in a given year is nearly perfectly correlated with the number of drowning deaths in swimming pools. The correlation is approximately r = .87. This is a pure statistical artifact that emerges when you search through a sufficiently large number of variable pairs.
Chocolate and Nobel Prizes. A study published in the New England Journal of Medicine found a high correlation between per-capita chocolate consumption and the number of Nobel Prizes per country. This likely reflects national wealth (wealthier countries have both more chocolate and more scientists).
The Third Variable Problem
Most spurious correlations arise from the third variable problem (also known as a confounding variable). A third variable is something that influences both variables you are observing, creating the illusion that they are causally connected.
A classic example from developmental psychology: there is a positive correlation between shoe size in children and the size of their vocabulary. Kids with bigger feet know more words! But that does not mean growing feet improves language skills. The third variable is age: older children have both bigger feet and richer vocabularies.
Another example: the correlation between the number of firefighters at a fire and the amount of property damage. More firefighters, more damage? No. Larger fires require more firefighters and cause more damage. Fire size is the third variable.
When Does Correlation Support Causation?
Correlation by itself never proves causation. However, there are conditions under which correlational data can provide a strong argument for a causal relationship.
1. Experimental design. If you randomly assigned participants to groups, manipulated the independent variable, and controlled for other factors, then a correlation found between variables does point toward a causal relationship. The experiment is the only design that allows causal conclusions. If you are planning an experiment, a good starting point is understanding how to write the methodology so that causal conclusions are well-supported.
2. Hill's criteria (for epidemiology). Bradford Hill proposed nine criteria that together can support a causal interpretation: strength of association, consistency, specificity, temporality, biological gradient, plausibility, coherence, experimental evidence, and analogy.
3. Temporal ordering. If you can demonstrate that a change in X always precedes a change in Y, that strengthens (but does not prove) the causal argument. Longitudinal studies are key here.
4. Mediation analysis. A more statistically sophisticated approach that tests pathways of influence and can distinguish direct from indirect effects.
Practical Example: Self-Esteem and Academic Performance
Suppose you are conducting a study on the relationship between self-esteem and academic performance among psychology students. You have collected data from 150 students using the Rosenberg Self-Esteem Scale and their grade point average.
Result: r = .38, p < .001
What you can say:
- •There is a moderate positive correlation between self-esteem and academic performance.
- •Students with higher self-esteem tend to have higher GPAs.
- •The coefficient of determination (r-squared = .14) indicates that self-esteem explains approximately 14% of the variance in academic performance.
What you cannot say:
- •High self-esteem causes better academic performance.
- •Improving self-esteem will lead to better grades.
Why not? Multiple explanations are possible:
- •Academic success may boost self-esteem (reverse causation)
- •Socioeconomic status may influence both (third variable)
- •General competence may underlie both (third variable)
To claim that self-esteem causes better performance, you would need an experiment: randomly assign students to a self-esteem enhancement program and a control group, then compare their grades after a semester.
Common Mistake
"The correlation is 0.8, therefore X causes Y."
This is by far the most common mistake in student research papers. A high correlation is tempting because it looks like clear evidence, but even a correlation of r = .99 does not prove causation without an appropriate research design.
This mistake appears in various forms:
- •"Results show that social media use decreases self-esteem (r = -.45)." Incorrect. Results show that social media use and self-esteem are negatively correlated.
- •"Stress causes insomnia (r = .52)." Incorrect. Stress and insomnia are positively correlated, but you cannot determine the direction of causation from a correlation alone.
- •"Screen time must be reduced because it causes depression (r = .33)." This is a policy conclusion drawn from correlational data.
How to write it correctly: Use the language of association, not causation. Instead of "causes" and "leads to," use "is associated with," "correlates with," "there is a tendency for."
Correlation in the Context of Other Tests
Correlation is only one tool in your statistical toolkit. If you are interested not just in whether variables are related but also in whether groups differ, see the article on the t-test for comparing two groups. Understanding correlation is the foundation for more advanced analyses such as regression, mediation analysis, and structural equation modeling.
Key Principles
- Correlation measures association, not causation.
- Pearson's r measures linear association; Spearman's rho is for nonparametric data.
- A correlation matrix provides a systematic overview of all variable pairs.
- The third variable is the most common cause of spurious correlations.
- Only experimental designs allow causal conclusions.
- Use Cohen's conventions for interpreting effect size (but with caution).
- Always use the language of association when reporting correlational findings.
Try the Istrazimo Platform
Istrazimo automatically generates a correlation matrix with a heat map and marks statistically significant correlations. Instead of manually computing correlations among ten variables (that is 45 pairs!), the platform does it in one click, with significance markers and a visual display you can use directly in your paper. Get started.
Try this in Istražimo
From creating surveys to statistical analysis, all in one place. Free for students and researchers.
Start for free →