krit.club logo

Statistics and Probability - Correlation and Regression Analysis

Grade 11ICSE

Review the key concepts, formulae, and examples before starting your quiz.

πŸ”‘Concepts

β€’

Scatter Diagrams: This is a visual representation where pairs of bivariate data (x,y)(x, y) are plotted as individual points on a Cartesian plane. If the cluster of points tends to rise from the bottom-left to the top-right, it indicates a positive correlation. If the points cluster in a downward slope from top-left to bottom-right, it indicates a negative correlation. A random spread of points suggest zero correlation.

β€’

Pearson’s Coefficient of Correlation (rr): This numerical value measures the strength and direction of a linear relationship between two variables. It ranges from βˆ’1-1 to +1+1. A value of r=+1r = +1 represents a perfect positive linear relationship (all points on a straight line rising), r=βˆ’1r = -1 represents a perfect negative linear relationship (all points on a straight line falling), and r=0r = 0 indicates no linear correlation.

β€’

Regression Lines: These are 'lines of best fit' that minimize the square of the distances between the actual data points and the line. There are two lines: the regression line of yy on xx (used to predict yy when xx is known) and the regression line of xx on yy (used to predict xx when yy is known). Visually, these lines intersect at the point of the means (xˉ,yˉ)(\bar{x}, \bar{y}).

β€’

Regression Coefficients (byxb_{yx} and bxyb_{xy}): These represent the slopes of the regression lines. byxb_{yx} is the slope of the line yy on xx, indicating the change in yy for a unit change in xx. Similarly, bxyb_{xy} is the slope for xx on yy. Both coefficients always have the same algebraic sign, which is also the sign of the correlation coefficient rr.

β€’

Spearman’s Rank Correlation (rsr_s): This method is used when variables are qualitative (like beauty or intelligence) or when the data is ranked. It measures the degree of similarity between two sets of rankings. If rs=1r_s = 1, the ranks are identical; if rs=βˆ’1r_s = -1, the ranks are in exactly opposite order.

β€’

Geometric Property of rr: The correlation coefficient rr is the geometric mean of the two regression coefficients, expressed as r=Β±byxβ‹…bxyr = \pm \sqrt{b_{yx} \cdot b_{xy}}. The sign of rr is chosen based on the sign of the coefficients. If both byxb_{yx} and bxyb_{xy} are positive, rr is positive; if both are negative, rr is negative.

β€’

Angle Between Regression Lines: If r=Β±1r = \pm 1, the two regression lines coincide, forming an angle of 0∘0^{\circ}, indicating a perfect linear relationship. If r=0r = 0, the lines are perpendicular to each other, intersecting at right angles at the point (xΛ‰,yΛ‰)(\bar{x}, \bar{y}).

πŸ“Formulae

Arithmetic Mean: xΛ‰=βˆ‘xn,yΛ‰=βˆ‘yn\bar{x} = \frac{\sum x}{n}, \bar{y} = \frac{\sum y}{n}

Covariance: Cov(x,y)=βˆ‘(xβˆ’xΛ‰)(yβˆ’yΛ‰)n=βˆ‘xynβˆ’xΛ‰yΛ‰Cov(x, y) = \frac{\sum (x - \bar{x})(y - \bar{y})}{n} = \frac{\sum xy}{n} - \bar{x}\bar{y}

Pearson’s Correlation Coefficient: r=Cov(x,y)ΟƒxΟƒyr = \frac{Cov(x, y)}{\sigma_x \sigma_y}

Computational formula for rr: r=nβˆ‘xyβˆ’(βˆ‘x)(βˆ‘y)[nβˆ‘x2βˆ’(βˆ‘x)2][nβˆ‘y2βˆ’(βˆ‘y)2]r = \frac{n \sum xy - (\sum x)(\sum y)}{\sqrt{[n \sum x^2 - (\sum x)^2][n \sum y^2 - (\sum y)^2]}}

Spearman’s Rank Correlation: rs=1βˆ’6βˆ‘d2n(n2βˆ’1)r_s = 1 - \frac{6 \sum d^2}{n(n^2 - 1)}, where dd is the difference in ranks.

Regression Coefficient byx=rΟƒyΟƒx=nβˆ‘xyβˆ’βˆ‘xβˆ‘ynβˆ‘x2βˆ’(βˆ‘x)2b_{yx} = r \frac{\sigma_y}{\sigma_x} = \frac{n \sum xy - \sum x \sum y}{n \sum x^2 - (\sum x)^2}

Regression Coefficient bxy=rΟƒxΟƒy=nβˆ‘xyβˆ’βˆ‘xβˆ‘ynβˆ‘y2βˆ’(βˆ‘y)2b_{xy} = r \frac{\sigma_x}{\sigma_y} = \frac{n \sum xy - \sum x \sum y}{n \sum y^2 - (\sum y)^2}

Regression Line of yy on xx: yβˆ’yΛ‰=byx(xβˆ’xΛ‰)y - \bar{y} = b_{yx}(x - \bar{x})

Regression Line of xx on yy: xβˆ’xΛ‰=bxy(yβˆ’yΛ‰)x - \bar{x} = b_{xy}(y - \bar{y})

πŸ’‘Examples

Problem 1:

Given the following data: βˆ‘x=30\sum x = 30, βˆ‘y=40\sum y = 40, βˆ‘xy=214\sum xy = 214, βˆ‘x2=220\sum x^2 = 220, βˆ‘y2=340\sum y^2 = 340, and n=5n = 5. Calculate the Pearson correlation coefficient rr.

Solution:

Step 1: Use the computational formula for rr: r=nβˆ‘xyβˆ’(βˆ‘x)(βˆ‘y)[nβˆ‘x2βˆ’(βˆ‘x)2][nβˆ‘y2βˆ’(βˆ‘y)2]r = \frac{n \sum xy - (\sum x)(\sum y)}{\sqrt{[n \sum x^2 - (\sum x)^2][n \sum y^2 - (\sum y)^2]}} Step 2: Substitute the given values: r=5(214)βˆ’(30)(40)[5(220)βˆ’(30)2][5(340)βˆ’(40)2]r = \frac{5(214) - (30)(40)}{\sqrt{[5(220) - (30)^2][5(340) - (40)^2]}} Step 3: Simplify the numerator: 1070βˆ’1200=βˆ’1301070 - 1200 = -130. Step 4: Simplify the denominator: [1100βˆ’900][1700βˆ’1600]=200Γ—100=20000β‰ˆ141.42\sqrt{[1100 - 900][1700 - 1600]} = \sqrt{200 \times 100} = \sqrt{20000} \approx 141.42. Step 5: Final calculation: r=βˆ’130141.42β‰ˆβˆ’0.919r = \frac{-130}{141.42} \approx -0.919.

Explanation:

The value of rβ‰ˆβˆ’0.92r \approx -0.92 indicates a very strong negative linear correlation between variables xx and yy.

Problem 2:

The two regression lines are given by x+2yβˆ’5=0x + 2y - 5 = 0 and 2x+3yβˆ’8=02x + 3y - 8 = 0. Find the mean values of xx and yy (xΛ‰\bar{x} and yΛ‰\bar{y}) and the correlation coefficient rr.

Solution:

Step 1: To find means, solve the equations simultaneously since the lines intersect at (xˉ,yˉ)(\bar{x}, \bar{y}):

  1. x+2y=5x + 2y = 5
  2. 2x+3y=82x + 3y = 8 Multiply (1) by 2: 2x+4y=102x + 4y = 10. Subtract (2) from this: (2x+4y)βˆ’(2x+3y)=10βˆ’8β‡’y=2(2x + 4y) - (2x + 3y) = 10 - 8 \Rightarrow y = 2. Substitute y=2y=2 into (1): x+2(2)=5β‡’x=1x + 2(2) = 5 \Rightarrow x = 1. So, xΛ‰=1,yΛ‰=2\bar{x} = 1, \bar{y} = 2. Step 2: Find byxb_{yx} and bxyb_{xy}. From (1), assume yy on xx: 2y=βˆ’x+5β‡’y=βˆ’0.5x+2.5β‡’byx=βˆ’0.52y = -x + 5 \Rightarrow y = -0.5x + 2.5 \Rightarrow b_{yx} = -0.5. From (2), assume xx on yy: 2x=βˆ’3y+8β‡’x=βˆ’1.5y+4β‡’bxy=βˆ’1.52x = -3y + 8 \Rightarrow x = -1.5y + 4 \Rightarrow b_{xy} = -1.5. Step 3: Check validity: byxβ‹…bxy=(βˆ’0.5)(βˆ’1.5)=0.75b_{yx} \cdot b_{xy} = (-0.5)(-1.5) = 0.75. Since 0.75≀10.75 \le 1, the assumptions are correct. Step 4: Calculate r=βˆ’byxβ‹…bxy=βˆ’0.75β‰ˆβˆ’0.866r = -\sqrt{b_{yx} \cdot b_{xy}} = -\sqrt{0.75} \approx -0.866 (negative because both coefficients are negative).

Explanation:

We identify the intersection of the regression lines to find the means and use the property that rr is the square root of the product of the regression slopes.