krit.club logo

Linear Regression - Scatter Diagrams and Lines of Best Fit

Grade 12ICSE

Review the key concepts, formulae, and examples before starting your quiz.

🔑Concepts

A Scatter Diagram is a visual representation of bivariate data where each pair of observations (xi,yi)(x_i, y_i) is plotted as a point on a Cartesian plane. If the points cluster around a straight line rising from left to right, it indicates a positive linear correlation; if they fall from left to right, it indicates a negative linear correlation.

The Line of Best Fit (Regression Line) is a mathematical line that best represents the trend of the data points. Visually, it is drawn such that the vertical distances (residuals) between the actual data points and the line are minimized. In the 'Least Squares Method', we minimize the sum of the squares of these residuals.

There are two regression lines for every bivariate distribution: the line of yy on xx (used to estimate yy for a given xx) and the line of xx on yy (used to estimate xx for a given yy). Geometrically, both lines always intersect at the point (xˉ,yˉ)(\bar{x}, \bar{y}), which represents the arithmetic means of the two variables.

The Regression Coefficients, denoted as byxb_{yx} and bxyb_{xy}, represent the slopes of the regression lines. byxb_{yx} measures the change in yy per unit change in xx. Visually, if byxb_{yx} is positive, the line of yy on xx slopes upward; if negative, it slopes downward.

The Correlation Coefficient (rr) is the geometric mean of the two regression coefficients: r=±byxbxyr = \pm \sqrt{b_{yx} \cdot b_{xy}}. The sign of rr is always the same as the sign of byxb_{yx} and bxyb_{xy}. On a graph, if r=1r = 1 or r=1r = -1, all points lie exactly on a single straight line.

The angle between the two regression lines indicates the strength of the correlation. If the lines are perpendicular, the correlation is zero (r=0r=0), appearing as a circular cloud of points. If the lines coincide (the angle is 00^\circ), the correlation is perfect (r=±1r = \pm 1).

📐Formulae

Mean: xˉ=xn\bar{x} = \frac{\sum x}{n} and yˉ=yn\bar{y} = \frac{\sum y}{n}

Regression Equation of yy on xx: yyˉ=byx(xxˉ)y - \bar{y} = b_{yx}(x - \bar{x})

Regression Equation of xx on yy: xxˉ=bxy(yyˉ)x - \bar{x} = b_{xy}(y - \bar{y})

Regression Coefficient byx=nxy(x)(y)nx2(x)2b_{yx} = \frac{n\sum xy - (\sum x)(\sum y)}{n\sum x^2 - (\sum x)^2} or byx=rσyσxb_{yx} = r \frac{\sigma_y}{\sigma_x}

Regression Coefficient bxy=nxy(x)(y)ny2(y)2b_{xy} = \frac{n\sum xy - (\sum x)(\sum y)}{n\sum y^2 - (\sum y)^2} or bxy=rσxσyb_{xy} = r \frac{\sigma_x}{\sigma_y}

Coefficient of Correlation: r=byxbxyr = \sqrt{b_{yx} \cdot b_{xy}} (Note: rr takes the sign of the coefficients)

Standard Deviation: σx=x2n(xˉ)2\sigma_x = \sqrt{\frac{\sum x^2}{n} - (\bar{x})^2}

💡Examples

Problem 1:

Given the following data: x=30\sum x = 30, y=40\sum y = 40, x2=220\sum x^2 = 220, xy=300\sum xy = 300, and n=5n = 5. Find the regression equation of yy on xx.

Solution:

  1. Calculate the means: xˉ=305=6\bar{x} = \frac{30}{5} = 6 and yˉ=405=8\bar{y} = \frac{40}{5} = 8. \ 2. Calculate the regression coefficient byxb_{yx}: byx=nxy(x)(y)nx2(x)2b_{yx} = \frac{n\sum xy - (\sum x)(\sum y)}{n\sum x^2 - (\sum x)^2} byx=5(300)(30)(40)5(220)(30)2=150012001100900=300200=1.5b_{yx} = \frac{5(300) - (30)(40)}{5(220) - (30)^2} = \frac{1500 - 1200}{1100 - 900} = \frac{300}{200} = 1.5 \ 3. Form the equation: y8=1.5(x6)y - 8 = 1.5(x - 6) \ y8=1.5x9y - 8 = 1.5x - 9 \ y=1.5x1y = 1.5x - 1

Explanation:

We first identify the necessary sums and calculate the means. Then, we use the formula for byxb_{yx} which uses the sums directly. Finally, we substitute the mean values and the coefficient into the point-slope form of the regression line equation.

Problem 2:

The two regression lines are 3x+2y26=03x + 2y - 26 = 0 and 6x+y31=06x + y - 31 = 0. Find the mean values of xx and yy and the correlation coefficient rr.

Solution:

  1. To find the means (xˉ,yˉ)(\bar{x}, \bar{y}), solve the equations simultaneously: \ 3x+2y=263x + 2y = 26 (Eq 1) \ 6x+y=31    y=316x6x + y = 31 \implies y = 31 - 6x (Eq 2) \ Substitute Eq 2 into Eq 1: 3x+2(316x)=26    3x+6212x=26    9x=36    xˉ=43x + 2(31 - 6x) = 26 \implies 3x + 62 - 12x = 26 \implies -9x = -36 \implies \bar{x} = 4. \ Substitute xˉ=4\bar{x} = 4 into Eq 2: y=316(4)=7    yˉ=7y = 31 - 6(4) = 7 \implies \bar{y} = 7. \ 2. To find rr, assume 3x+2y26=03x + 2y - 26 = 0 is yy on xx: 2y=3x+26    y=1.5x+132y = -3x + 26 \implies y = -1.5x + 13, so byx=1.5b_{yx} = -1.5. \ Assume 6x+y31=06x + y - 31 = 0 is xx on yy: 6x=y+31    x=16y+3166x = -y + 31 \implies x = -\frac{1}{6}y + \frac{31}{6}, so bxy=0.1667b_{xy} = -0.1667. \ 3. Check consistency: byxbxy=(1.5)(0.1667)=0.25b_{yx} \cdot b_{xy} = (-1.5) \cdot (-0.1667) = 0.25. Since 0.2510.25 \le 1, the assumption is correct. \ 4. r=0.25=0.5r = -\sqrt{0.25} = -0.5 (sign is negative because both coefficients are negative).

Explanation:

Since the regression lines intersect at the means, we solve the system of equations to find xˉ\bar{x} and yˉ\bar{y}. To find rr, we identify byxb_{yx} and bxyb_{xy} by rearranging the equations, ensuring the product of the slopes is less than or equal to 1. The sign of rr matches the slopes.