krit.club logo

Linear Regression - Lines of Regression (x on y, y on x)

Grade 12ICSE

Review the key concepts, formulae, and examples before starting your quiz.

🔑Concepts

Definition of Regression Lines: Regression lines are the 'best-fit' straight lines that represent the mathematical relationship between two variables, xx and yy. In a scatter plot, these lines are positioned to minimize the distance between the data points and the line itself.

Line of Regression of yy on xx: This line is used to estimate or predict the value of the dependent variable yy for a given value of the independent variable xx. Visually, this line minimizes the sum of the squares of the vertical deviations (distances parallel to the yy-axis) between the observed points and the line.

Line of Regression of xx on yy: This line is used to estimate or predict the value of the independent variable xx for a given value of the dependent variable yy. Visually, this line minimizes the sum of the squares of the horizontal deviations (distances parallel to the xx-axis) between the observed points and the line.

The Centroid (Point of Intersection): Both regression lines always pass through the point (xˉ,yˉ)(\bar{x}, \bar{y}), where xˉ\bar{x} is the mean of the xx-values and yˉ\bar{y} is the mean of the yy-values. On a graph, this point acts as the pivot or balance point for both lines.

Regression Coefficients: The slopes of the lines, denoted as byxb_{yx} (yy on xx) and bxyb_{xy} (xx on yy), indicate the change in one variable for a unit change in the other. A key property is that both coefficients must have the same sign (either both positive or both negative), which is also the sign of the correlation coefficient rr.

Correlation and the Angle between Lines: The geometric angle θ\theta between the two regression lines indicates the strength of the correlation. If r=±1r = \pm 1, the lines coincide (the angle is 00^{\circ}), representing perfect correlation. If r=0r = 0, the lines are perpendicular (intersecting at 9090^{\circ}), indicating no linear correlation.

The Geometric Mean Property: The correlation coefficient rr is the geometric mean of the two regression coefficients. This is expressed as r2=byxbxyr^2 = b_{yx} \cdot b_{xy}. Because r21r^2 \leq 1, it follows that the product of the two slopes can never exceed 1.

Estimation Validity: When predicting yy, always use the yy on xx line; when predicting xx, always use the xx on yy line. Using the wrong line for prediction results in higher estimation error.

📐Formulae

Line of regression of yy on xx: yyˉ=byx(xxˉ)y - \bar{y} = b_{yx}(x - \bar{x})

Line of regression of xx on yy: xxˉ=bxy(yyˉ)x - \bar{x} = b_{xy}(y - \bar{y})

Regression coefficient byx=rσyσxb_{yx} = r \frac{\sigma_y}{\sigma_x}

Regression coefficient bxy=rσxσyb_{xy} = r \frac{\sigma_x}{\sigma_y}

Correlation coefficient: r=±byxbxyr = \pm \sqrt{b_{yx} \cdot b_{xy}}

Standard calculation for byx=(xxˉ)(yyˉ)(xxˉ)2=Nxy(x)(y)Nx2(x)2b_{yx} = \frac{\sum (x - \bar{x})(y - \bar{y})}{\sum (x - \bar{x})^2} = \frac{N \sum xy - (\sum x)(\sum y)}{N \sum x^2 - (\sum x)^2}

Standard calculation for bxy=(xxˉ)(yyˉ)(yyˉ)2=Nxy(x)(y)Ny2(y)2b_{xy} = \frac{\sum (x - \bar{x})(y - \bar{y})}{\sum (y - \bar{y})^2} = \frac{N \sum xy - (\sum x)(\sum y)}{N \sum y^2 - (\sum y)^2}

Covariance: Cov(x,y)=(xxˉ)(yyˉ)N\text{Cov}(x,y) = \frac{\sum (x - \bar{x})(y - \bar{y})}{N}

💡Examples

Problem 1:

Given the following data: Mean of x=40x = 40, Mean of y=50y = 50, Standard deviation of x=2x = 2, Standard deviation of y=3y = 3, and Correlation coefficient r=0.6r = 0.6. Find the two regression lines and estimate yy when x=42x = 42.

Solution:

  1. Find byxb_{yx}: byx=rσyσx=0.632=0.9b_{yx} = r \frac{\sigma_y}{\sigma_x} = 0.6 \cdot \frac{3}{2} = 0.9.
  2. Find bxyb_{xy}: bxy=rσxσy=0.623=0.4b_{xy} = r \frac{\sigma_x}{\sigma_y} = 0.6 \cdot \frac{2}{3} = 0.4.
  3. Equation of yy on xx: y50=0.9(x40)y=0.9x36+50y=0.9x+14y - 50 = 0.9(x - 40) \Rightarrow y = 0.9x - 36 + 50 \Rightarrow y = 0.9x + 14.
  4. Equation of xx on yy: x40=0.4(y50)x=0.4y20+40x=0.4y+20x - 40 = 0.4(y - 50) \Rightarrow x = 0.4y - 20 + 40 \Rightarrow x = 0.4y + 20.
  5. Estimate yy for x=42x = 42: Using the yy on xx line, y=0.9(42)+14=37.8+14=51.8y = 0.9(42) + 14 = 37.8 + 14 = 51.8.

Explanation:

We first calculate the regression coefficients using the standard deviations and correlation. Then we use the point-slope form with the means (xˉ,yˉ)(\bar{x}, \bar{y}) to derive the linear equations. Finally, we use the yy on xx line for prediction since xx is given.

Problem 2:

The two lines of regression are x+2y5=0x + 2y - 5 = 0 and 2x+3y8=02x + 3y - 8 = 0. Find the mean values of xx and yy, and the correlation coefficient rr.

Solution:

  1. Find Means: Solve the equations simultaneously. x+2y=5x + 2y = 5 (i) 2x+3y=82x + 3y = 8 (ii) Multiply (i) by 2: 2x+4y=102x + 4y = 10. Subtract (ii) from this: (2x2x)+(4y3y)=108y=2(2x - 2x) + (4y - 3y) = 10 - 8 \Rightarrow y = 2. Substitute y=2y=2 in (i): x+2(2)=5x=1x + 2(2) = 5 \Rightarrow x = 1. So, xˉ=1,yˉ=2\bar{x} = 1, \bar{y} = 2.
  2. Find rr: Assume x+2y5=0x + 2y - 5 = 0 is the line yy on xx. 2y=x+5y=12x+2.5byx=0.52y = -x + 5 \Rightarrow y = -\frac{1}{2}x + 2.5 \Rightarrow b_{yx} = -0.5. Then 2x+3y8=02x + 3y - 8 = 0 must be xx on yy. 2x=3y+8x=32y+4bxy=1.52x = -3y + 8 \Rightarrow x = -\frac{3}{2}y + 4 \Rightarrow b_{xy} = -1.5.
  3. Check validity: byxbxy=(0.5)(1.5)=0.75b_{yx} \cdot b_{xy} = (-0.5) \cdot (-1.5) = 0.75. Since 0.7510.75 \leq 1, our assumption is correct.
  4. Calculate rr: r=0.750.866r = -\sqrt{0.75} \approx -0.866 (negative because both bb values are negative).

Explanation:

The means are found at the intersection of the two lines. To find rr, we assume which line is which, calculate the slopes, and verify that their product is 1\leq 1. The sign of rr matches the sign of the slopes.