krit.club logo

Linear Regression - Method of Least Squares

Grade 12ICSE

Review the key concepts, formulae, and examples before starting your quiz.

🔑Concepts

Definition of Bivariate Data and Scatter Plots: Linear regression analyzes the relationship between two variables, an independent variable xx and a dependent variable yy. Visually, this is represented by a scatter plot where data points (x,y)(x, y) are plotted on a Cartesian plane. If the points cluster around a straight line, a linear relationship exists.

The Method of Least Squares: This is a mathematical technique used to find the 'line of best fit' by minimizing the sum of the squares of the vertical deviations (residuals) between each observed data point and the line. On a graph, these residuals are the vertical segments connecting the points to the regression line.

Regression Line of yy on xx: This line is used to estimate the value of yy for a given value of xx. Visually, it is the straight line that minimizes the sum of squares of vertical distances. Its slope, byxb_{yx}, indicates how many units yy changes for every unit change in xx.

Regression Line of xx on yy: This line is used to estimate the value of xx for a given value of yy. It minimizes the sum of squares of horizontal distances from the points to the line. Visually, it may differ from the yy on xx line unless the correlation is perfect (r=±1r = \pm 1).

Properties of Regression Coefficients: The coefficients byxb_{yx} and bxyb_{xy} always have the same sign as the correlation coefficient rr. If the slope of the line on the graph is upwards (positive), both coefficients and rr are positive; if downwards, they are all negative.

The Intersection Point: Both regression lines, yy on xx and xx on yy, always pass through the point of arithmetic means (xˉ,yˉ)(\bar{x}, \bar{y}). On a coordinate system, this point acts as the 'center of gravity' for the data distribution.

Relationship with Correlation Coefficient: The correlation coefficient rr is the geometric mean of the two regression coefficients, expressed as r=±byxbxyr = \pm \sqrt{b_{yx} \cdot b_{xy}}. Geometrically, the closer the two regression lines are to each other, the stronger the correlation (approaching r=1|r| = 1).

📐Formulae

Mean of xx and yy: xˉ=xn,yˉ=yn\bar{x} = \frac{\sum x}{n}, \bar{y} = \frac{\sum y}{n}

Regression Coefficient of yy on xx: byx=nxy(x)(y)nx2(x)2b_{yx} = \frac{n\sum xy - (\sum x)(\sum y)}{n\sum x^2 - (\sum x)^2}

Regression Coefficient of xx on yy: bxy=nxy(x)(y)ny2(y)2b_{xy} = \frac{n\sum xy - (\sum x)(\sum y)}{n\sum y^2 - (\sum y)^2}

Regression Equation of yy on xx: yyˉ=byx(xxˉ)y - \bar{y} = b_{yx}(x - \bar{x})

Regression Equation of xx on yy: xxˉ=bxy(yyˉ)x - \bar{x} = b_{xy}(y - \bar{y})

Relationship with Standard Deviation: byx=rσyσxb_{yx} = r \frac{\sigma_y}{\sigma_x} and bxy=rσxσyb_{xy} = r \frac{\sigma_x}{\sigma_y}

Correlation Coefficient: r=±byx×bxyr = \pm \sqrt{b_{yx} \times b_{xy}}

💡Examples

Problem 1:

Given the following data: x:[1,2,3,4,5]x: [1, 2, 3, 4, 5] and y:[2,3,5,4,6]y: [2, 3, 5, 4, 6], find the regression equation of yy on xx.

Solution:

  1. Calculate sums: x=15\sum x = 15, y=20\sum y = 20, x2=1+4+9+16+25=55\sum x^2 = 1+4+9+16+25 = 55, xy=(1×2)+(2×3)+(3×5)+(4×4)+(5×6)=2+6+15+16+30=69\sum xy = (1\times2)+(2\times3)+(3\times5)+(4\times4)+(5\times6) = 2+6+15+16+30 = 69. \n2. Number of observations n=5n = 5. \n3. Calculate means: xˉ=155=3\bar{x} = \frac{15}{5} = 3, yˉ=205=4\bar{y} = \frac{20}{5} = 4. \n4. Calculate byxb_{yx}: byx=nxyxynx2(x)2=5(69)(15)(20)5(55)(15)2=345300275225=4550=0.9b_{yx} = \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - (\sum x)^2} = \frac{5(69) - (15)(20)}{5(55) - (15)^2} = \frac{345 - 300}{275 - 225} = \frac{45}{50} = 0.9. \n5. Form the equation: y4=0.9(x3)    y4=0.9x2.7    y=0.9x+1.3y - 4 = 0.9(x - 3) \implies y - 4 = 0.9x - 2.7 \implies y = 0.9x + 1.3.

Explanation:

To find the line of yy on xx, we first compute the necessary summations from the table. We then find the means of xx and yy. Using the least squares formula for byxb_{yx}, we determine the slope of the line. Finally, we use the point-slope form with the mean point (3,4)(3, 4) to derive the linear equation.

Problem 2:

If the two regression lines are 3x+2y26=03x + 2y - 26 = 0 and 6x+y31=06x + y - 31 = 0, find the mean values of xx and yy.

Solution:

  1. Since both regression lines pass through the mean point (xˉ,yˉ)(\bar{x}, \bar{y}), we solve the equations simultaneously. \n2. Equation 1: 3xˉ+2yˉ=263\bar{x} + 2\bar{y} = 26. \n3. Equation 2: 6xˉ+yˉ=316\bar{x} + \bar{y} = 31. \n4. Multiply Eq 2 by 2: 12xˉ+2yˉ=6212\bar{x} + 2\bar{y} = 62. \n5. Subtract Eq 1 from this result: (12xˉ3xˉ)+(2yˉ2yˉ)=6226    9xˉ=36    xˉ=4(12\bar{x} - 3\bar{x}) + (2\bar{y} - 2\bar{y}) = 62 - 26 \implies 9\bar{x} = 36 \implies \bar{x} = 4. \n6. Substitute xˉ=4\bar{x} = 4 into Eq 2: 6(4)+yˉ=31    24+yˉ=31    yˉ=76(4) + \bar{y} = 31 \implies 24 + \bar{y} = 31 \implies \bar{y} = 7. \n7. The mean values are xˉ=4\bar{x} = 4 and yˉ=7\bar{y} = 7.

Explanation:

The intersection point of the two regression lines is always the point of the means. By treating the regression equations as a system of linear equations and solving for the variables, we directly obtain the average values of the data set.