krit.club logo

Statistics - Scatter diagrams and correlation

Grade 11IGCSE

Review the key concepts, formulae, and examples before starting your quiz.

🔑Concepts

Bivariate Data: Data that involves two variables to determine if there is a relationship between them.

Scatter Diagram: A graph where individual data points are plotted to visualize the relationship between two variables.

Positive Correlation: As one variable increases, the other variable also tends to increase.

Negative Correlation: As one variable increases, the other variable tends to decrease.

Zero/No Correlation: No apparent relationship between the two variables; points are scattered randomly.

Strength of Correlation: Described as 'Strong' if points are close to a straight line, or 'Weak' if they are widely spread.

Line of Best Fit: A straight line drawn through the middle of the data points, used to make predictions.

Mean Point: The point (xˉ,yˉ)(\bar{x}, \bar{y}) through which the line of best fit must always pass.

Interpolation: Predicting a value within the range of the given data (usually reliable).

Extrapolation: Predicting a value outside the range of the given data (often unreliable).

Correlation vs Causation: A correlation between two variables does not necessarily mean that one causes the other.

📐Formulae

Mean of x: xˉ=xn\bar{x} = \frac{\sum x}{n}

Mean of y: yˉ=yn\bar{y} = \frac{\sum y}{n}

Equation of the line of best fit: y=mx+cy = mx + c

💡Examples

Problem 1:

A student collects data on the temperature (xx in °C) and the number of ice creams sold (yy). The data points are: (20, 50), (22, 60), (24, 75), (26, 90), (28, 105). Describe the correlation and calculate the mean point.

Solution:

Correlation: Strong Positive Correlation. Mean point: xˉ=20+22+24+26+285=24\bar{x} = \frac{20+22+24+26+28}{5} = 24, yˉ=50+60+75+90+1055=76\bar{y} = \frac{50+60+75+90+105}{5} = 76. Mean point = (24,76)(24, 76).

Explanation:

The correlation is positive because as temperature increases, ice cream sales increase. It is strong because the points follow a clear linear path. The mean point is found by averaging the x-values and y-values separately.

Problem 2:

Using a scatter diagram, a line of best fit is drawn for the relationship between hours spent gaming and exam scores. The equation is y=2x+85y = -2x + 85. If a student games for 10 hours, what is their predicted score? Is this interpolation if the data range was 0 to 8 hours?

Solution:

Predicted score: y=2(10)+85=65y = -2(10) + 85 = 65. This is Extrapolation.

Explanation:

Substitute x=10x=10 into the linear equation. Since 10 hours is outside the original data range (0-8 hours), the prediction is an extrapolation and may not be accurate.

Problem 3:

Explain why a line of best fit should pass through the mean point (xˉ,yˉ)(\bar{x}, \bar{y}).

Solution:

The mean point represents the 'center' of the bivariate data set.

Explanation:

Mathematically, the line of best fit (specifically the least squares regression line) is anchored by the average values of the variables. Drawing it through (xˉ,yˉ)(\bar{x}, \bar{y}) ensures the line is balanced among the data points.