krit.club logo

Statistics and Probability - Bivariate Statistics (Correlation and Linear Regression)

Grade 11IB

Review the key concepts, formulae, and examples before starting your quiz.

πŸ”‘Concepts

β€’

Bivariate Data and Scatter Diagrams: Bivariate data consists of pairs of linked variables, (x,y)(x, y), where xx is the independent (explanatory) variable and yy is the dependent (response) variable. Visually, this is represented on a scatter diagram where xx is plotted on the horizontal axis and yy on the vertical axis. Each point (xi,yi)(x_i, y_i) represents an individual observation, allowing us to see patterns, trends, or clusters in the relationship between the two variables.

β€’

Correlation Direction and Strength: Correlation describes the nature of the linear relationship between variables. Visually, a positive correlation appears as a 'cloud' of points sloping upwards from the bottom-left to the top-right, indicating that as xx increases, yy tends to increase. A negative correlation slopes downwards from top-left to bottom-right. The 'strength' refers to how closely the points cluster around a straight line; a tight, narrow corridor of points indicates a strong correlation, while a widely dispersed cloud indicates a weak correlation.

β€’

Pearson’s Product-Moment Correlation Coefficient (rr): This numerical value measures the strength and direction of a linear relationship, ranging from βˆ’1-1 to +1+1. A value of r=1r = 1 indicates a perfect positive linear correlation (all points lie exactly on a line with a positive slope), r=βˆ’1r = -1 indicates a perfect negative linear correlation, and r=0r = 0 indicates no linear correlation. Visually, as rr moves closer to 00, the scatter plot appears more like a random 'blob' without a discernible linear path.

β€’

The Mean Point (Centroid): In any bivariate dataset, the mean point is defined by the coordinates (xˉ,yˉ)(\bar{x}, \bar{y}). This point is the geometric center of the data. A key visual and mathematical property is that the least squares regression line must always pass through this point (xˉ,yˉ)(\bar{x}, \bar{y}). On a graph, you can identify this as the 'balance point' of the scatter plot.

β€’

The Line of Best Fit (Least Squares Regression): This is the unique line that minimizes the sum of the squares of the vertical distances (known as residuals) between the data points and the line itself. The equation is typically written as y=ax+by = ax + b (or y=mx+cy = mx + c). Visually, it represents the 'trend line' that best averages out the distribution of points. It is specifically used to predict values of yy given values of xx.

β€’

Interpretation of Regression Coefficients: In the regression equation y=ax+by = ax + b, the gradient aa represents the predicted change in the dependent variable yy for every one-unit increase in the independent variable xx. The yy-intercept bb represents the predicted value of yy when x=0x = 0. Visually, aa is the steepness of the line, and bb is where the line crosses the vertical axis.

β€’

Interpolation vs. Extrapolation: Interpolation is the process of predicting a yy-value for an xx-value that falls within the original range of the data; this is generally considered reliable. Extrapolation involves predicting yy for an xx-value outside the data range. Visually, this means extending the regression line beyond the last known data point. Extrapolation is often unreliable as the linear trend may not continue indefinitely.

πŸ“Formulae

Mean of xx: xΛ‰=βˆ‘i=1nxin\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}

Mean of yy: yΛ‰=βˆ‘i=1nyin\bar{y} = \frac{\sum_{i=1}^{n} y_i}{n}

Equation of the Regression Line: y=ax+by = ax + b

Point-Slope form using the Mean Point: yβˆ’yΛ‰=a(xβˆ’xΛ‰)y - \bar{y} = a(x - \bar{x})

Pearson's Correlation Coefficient: r=SxySxSyr = \frac{S_{xy}}{S_x S_y} (Note: Usually calculated via GDC in IB Math)

The Coefficient of Determination: r2r^2 (Represents the proportion of variance in yy explained by xx)

πŸ’‘Examples

Problem 1:

A student studies for various numbers of hours (xx) and receives the following test scores (yy): (2,40),(4,60),(6,75),(8,90),(10,95)(2, 40), (4, 60), (6, 75), (8, 90), (10, 95). Calculate the mean point (xˉ,yˉ)(\bar{x}, \bar{y}) and determine the equation of the regression line yy on xx using a GDC. Predict the score for a student who studies for 7 hours.

Solution:

Step 1: Calculate the mean of xx: xΛ‰=2+4+6+8+105=305=6\bar{x} = \frac{2 + 4 + 6 + 8 + 10}{5} = \frac{30}{5} = 6. Step 2: Calculate the mean of yy: yΛ‰=40+60+75+90+955=3605=72\bar{y} = \frac{40 + 60 + 75 + 90 + 95}{5} = \frac{360}{5} = 72. The mean point is (6,72)(6, 72). Step 3: Using a GDC for linear regression, we find the coefficients: aβ‰ˆ7.25a \approx 7.25 and bβ‰ˆ28.5b \approx 28.5. Thus, the equation is y=7.25x+28.5y = 7.25x + 28.5. Step 4: Predict for x=7x = 7: y=7.25(7)+28.5=50.75+28.5=79.25y = 7.25(7) + 28.5 = 50.75 + 28.5 = 79.25.

Explanation:

First, we find the average of both variables to locate the centroid. Then, we use the least squares method (via GDC) to find the slope and intercept. Finally, we substitute x=7x = 7 into our regression model to find the predicted score.

Problem 2:

For a dataset comparing temperature (xx) and ice cream sales (yy), the correlation coefficient is found to be r=0.92r = 0.92. The regression line is y=15x+120y = 15x + 120. If the data was collected for temperatures between 10∘C10^{\circ}C and 30∘C30^{\circ}C, discuss the reliability of predicting sales at 25∘C25^{\circ}C and 45∘C45^{\circ}C.

Solution:

Step 1: Analyze r=0.92r = 0.92. This indicates a very strong positive linear correlation, meaning the regression line is a good fit for the data. Step 2: Evaluate x=25∘Cx = 25^{\circ}C. Since 25 is within the range [10,30][10, 30], this is interpolation. The prediction y=15(25)+120=495y = 15(25) + 120 = 495 is likely to be reliable. Step 3: Evaluate x=45∘Cx = 45^{\circ}C. Since 45 is outside the range [10,30][10, 30], this is extrapolation. Even though y=15(45)+120=795y = 15(45) + 120 = 795, this prediction is unreliable because we cannot be certain the linear trend continues at such high temperatures.

Explanation:

Reliability depends on two factors: the strength of rr and whether we are interpolating or extrapolating. While the strong rr suggests a good model, the prediction at 45∘C45^{\circ}C is risky because it lies far beyond the observed data range.