Correlation: The Relationship Between Bivariate Data
What is Bivariate Data?
Bivariate data involves pairs of values, typically written as (x, y), where each pair represents two measurements or observations for the same item or person. For example, a teacher might record for each student: (hours spent studying, test score). Each student gives one pair of data points. The first variable, often called the independent or explanatory variable (like study hours), is plotted on the x-axis. The second variable, the dependent or response variable (like test score), is plotted on the y-axis. The goal is to see if there is a relationship between them.
Visualizing Relationships with Scatter Plots
The best way to start understanding bivariate data is by creating a scatter plot. A scatter plot is a graph where each dot represents one pair of data. By looking at the overall pattern of dots, we can often see the relationship.
Let's imagine a small survey of 5 students and their study habits.
| Student | Study Hours (x) | Test Score (y) |
|---|---|---|
| Anna | 1 | 55 |
| Ben | 2 | 65 |
| Chloe | 3 | 75 |
| David | 4 | 85 |
| Ella | 5 | 95 |
If we plot these points, we would see them form an almost perfect straight line sloping upwards. This visual pattern suggests a strong positive relationship: as study hours increase, test scores also increase.
Types of Correlation: Direction and Strength
Correlation is described by two main characteristics: its direction and its strength.
Direction:
- Positive Correlation: As one variable increases, the other also increases. Example: Height and shoe size.
- Negative Correlation: As one variable increases, the other decreases. Example: The more time spent playing video games, the lower the test score might be.
- No Correlation (Zero): There is no apparent relationship between the two variables. Example: A person's shoe size and their intelligence.
Strength refers to how closely the points on a scatter plot follow a straight line. If all points lie exactly on a straight line, the correlation is perfect (strength = 1 or -1). If the points are widely scattered with no clear pattern, the strength is weak (close to 0).
The Correlation Coefficient: Pearson's r
While a scatter plot gives a visual clue, the correlation coefficient[1], often symbolized by the letter $r$, gives us a precise numerical measure. The most common one is Pearson's correlation coefficient. Its value always lies between -1 and +1.
- $r = +1$: Perfect positive linear correlation.
- $r > 0$: Positive correlation.
- $r = 0$: No linear correlation.
- $r < 0$: Negative correlation.
- $r = -1$: Perfect negative linear correlation.
The formula for Pearson's $r$ is:
Where:
$x_i$ and $y_i$ are the individual data points,
$\bar{x}$ and $\bar{y}$ are the means (averages) of the x and y values,
$\sum$ means "the sum of".
This formula essentially measures how much the two variables change together, divided by the product of how much each variable changes on its own. Don't worry—you usually use a calculator or software to compute it!
Step-by-Step Calculation with a Simple Example
Let's calculate $r$ for a tiny dataset: (x, y) = (1, 1), (2, 3), (3, 2). We'll follow the formula step-by-step.
Step 1: Find the means.
$\bar{x} = (1+2+3)/3 = 2$
$\bar{y} = (1+3+2)/3 = 2$
Step 2: Calculate deviations and products.
| x | y | $(x-\bar{x})$ | $(y-\bar{y})$ | $(x-\bar{x})(y-\bar{y})$ | $(x-\bar{x})^2$ | $(y-\bar{y})^2$ |
|---|---|---|---|---|---|---|
| 1 | 1 | -1 | -1 | 1 | 1 | 1 |
| 2 | 3 | 0 | 1 | 0 | 0 | 1 |
| 3 | 2 | 1 | 0 | 0 | 1 | 0 |
| Sum | 1 | 2 | 2 |
Step 3: Plug into the formula.
$$ r = \frac{1}{\sqrt{2 \times 2}} = \frac{1}{\sqrt{4}} = \frac{1}{2} = 0.5 $$
We get $r = 0.5$. This indicates a moderate positive correlation. As x increased from 1 to 3, y generally increased, but not in a perfect straight line.
Correlation in the Real World: Ice Cream Sales and Temperature
Let's look at a practical example most people can relate to: ice cream sales and daily temperature. Intuitively, we expect hotter days to lead to more ice cream sales. This would be a positive correlation. A shop owner might collect data over 10 days, recording the high temperature and the number of ice cream cones sold. Plotting this data would likely show an upward trend. If they calculated $r$, it might be around $0.8$, indicating a strong positive relationship. This information is useful! The owner can predict sales based on the weather forecast and manage inventory accordingly.
However, it's crucial to remember that correlation tells us two things change together; it does not tell us that one causes the other. This leads us to a critical warning.
Example: There is a strong positive correlation between ice cream sales and shark attacks. Does eating ice cream cause shark attacks? No! The confounding variable is summer season/hot weather. Hot weather causes more people to buy ice cream and also causes more people to swim in the ocean, which increases the chance of shark encounters.
Important Questions
Not necessarily. A correlation coefficient of 0 means there is no linear relationship. The variables could still have a strong non-linear relationship. For example, if you plot the points for the equation $y = x^2$ over symmetric x-values, the linear correlation $r$ will be close to 0, even though there is a perfect parabolic relationship. The scatter plot would show a clear U-shape, not a random cloud.
Yes, but with caution. If a strong correlation exists, we can use one variable to make rough predictions or estimates about the other. This is the foundation for linear regression, which finds the "line of best fit" through the data points. For instance, if we know the correlation between study hours and scores is strong, a teacher might predict that a student who studied for 4 hours will score around a certain mark. However, correlation-based predictions are estimates, not certainties, as other factors can influence the outcome.
In everyday language, they are often used interchangeably. In statistics, "association" is a broader term meaning any relationship between variables. "Correlation" is more specific, usually referring to the strength and direction of a linear relationship measured by a coefficient like Pearson's r. So, all correlations are associations, but not all associations are (linear) correlations.
Footnote
[1] Correlation Coefficient (r): A numerical measure, between -1 and +1, of the strength and direction of the linear relationship between two variables. A value of +1 indicates a perfect positive linear relationship, -1 a perfect negative linear relationship, and 0 indicates no linear relationship.
[2] Confounding Variable: A third, often unmeasured, variable that influences both the independent and dependent variables, creating a false impression of a direct causal relationship between them.
