krit.club logo

Statistics and Probability - Bivariate Statistics (Correlation and Linear Regression)

Grade 12IB

Review the key concepts, formulae, and examples before starting your quiz.

πŸ”‘Concepts

β€’

Scatter Diagrams: A visual representation of bivariate data where the independent variable (xx) is plotted on the horizontal axis and the dependent variable (yy) on the vertical axis. The pattern of dots reveals the nature of the relationship; a 'cigar-shaped' cluster suggests a linear correlation, while a random cloud of points suggests no correlation.

β€’

Pearson’s Product-Moment Correlation Coefficient (rr): A numerical measure of the strength and direction of a linear relationship between two variables. It ranges from βˆ’1-1 to +1+1, where +1+1 indicates a perfect positive linear correlation (dots forming a line with a positive gradient), βˆ’1-1 indicates a perfect negative linear correlation (dots forming a line with a negative gradient), and 00 indicates no linear correlation.

β€’

Interpretation of rr Strength: Generally, ∣r∣>0.75|r| > 0.75 is considered a strong correlation, 0.5<∣r∣<0.750.5 < |r| < 0.75 is moderate, and ∣r∣<0.5|r| < 0.5 is weak. Visually, a strong correlation means the points in a scatter plot lie very close to a straight line, whereas a weak correlation shows points more widely dispersed.

β€’

The Mean Point: Every regression line of yy on xx must pass through the mean point (xˉ,yˉ)(\bar{x}, \bar{y}). Visually, this point acts as a 'pivot' or 'centroid' for the data set, and plotting it on a scatter diagram helps verify if a calculated regression line is positioned correctly.

β€’

Least Squares Regression Line (yy on xx): The line of best fit defined as y=ax+by = ax + b (or y=mx+cy = mx + c) that minimizes the sum of the squares of the vertical distances (residuals) between each data point and the line. It is used specifically to predict the value of yy for a given xx.

β€’

Interpolation and Extrapolation: Interpolation is making a prediction within the range of the original xx-values, which is generally reliable if rr is strong. Extrapolation is predicting values outside the range of data; this is risky and often unreliable because the linear trend may not continue indefinitely.

β€’

Correlation vs. Causation: A high correlation between two variables does not necessarily mean that changes in one variable cause changes in the other. There may be a 'lurking variable' influencing both, or the relationship may be purely coincidental.

πŸ“Formulae

Mean of xx: xΛ‰=βˆ‘i=1nxin\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}

Mean of yy: yΛ‰=βˆ‘i=1nyin\bar{y} = \frac{\sum_{i=1}^{n} y_i}{n}

Pearson’s Correlation Coefficient: r=sxysxsyr = \frac{s_{xy}}{s_x s_y} where sxys_{xy} is covariance

Linear Regression Equation: y=ax+by = ax + b

Gradient of Regression Line: a=sxysx2a = \frac{s_{xy}}{s_x^2}

Equation using the mean point: yβˆ’yΛ‰=a(xβˆ’xΛ‰)y - \bar{y} = a(x - \bar{x})

πŸ’‘Examples

Problem 1:

A student tracks the number of hours spent studying (xx) and the test score achieved (yy) for 5 students: (2,40),(4,60),(6,70),(8,90),(10,95)(2, 40), (4, 60), (6, 70), (8, 90), (10, 95). Calculate the mean point (xˉ,yˉ)(\bar{x}, \bar{y}) and the equation of the regression line yy on xx given that a=7.25a = 7.25.

Solution:

  1. Calculate the mean of xx: xˉ=2+4+6+8+105=305=6\bar{x} = \frac{2 + 4 + 6 + 8 + 10}{5} = \frac{30}{5} = 6.
  2. Calculate the mean of yy: yˉ=40+60+70+90+955=3555=71\bar{y} = \frac{40 + 60 + 70 + 90 + 95}{5} = \frac{355}{5} = 71.
  3. Use the point-slope form with the mean point (6,71)(6, 71) and a=7.25a = 7.25: yβˆ’71=7.25(xβˆ’6)y - 71 = 7.25(x - 6) yβˆ’71=7.25xβˆ’43.5y - 71 = 7.25x - 43.5 y=7.25x+27.5y = 7.25x + 27.5.

Explanation:

The mean point is the average of all coordinates and is a fixed point on the regression line. We then use the linear equation formula to find the y-intercept bb.

Problem 2:

A dataset has a correlation coefficient of r=0.92r = 0.92 and a regression line y=2.5x+10y = 2.5x + 10. If the xx-values in the data range from 55 to 5050, predict the value of yy when x=12x = 12 and discuss the reliability.

Solution:

  1. Substitute x=12x = 12 into the equation: y=2.5(12)+10y = 2.5(12) + 10 y=30+10=40y = 30 + 10 = 40.
  2. Reliability: Since x=12x = 12 is within the range [5,50][5, 50], this is interpolation. Combined with a strong correlation (r=0.92r = 0.92), the prediction is considered very reliable.

Explanation:

Predictions are evaluated based on two criteria: whether the value is an interpolation/extrapolation and the strength of the correlation coefficient rr.