# Inference for Regression

My Notes
• Required.
Learning Material 2
• PDF
Slides Statistics pt2 Inference for Regression.pdf
• PDF
Report mistake
Transcript

00:02 Welcome back for lecture 10 in which we discuss inference for regression So let's look at regression from a dfiferent perspective.

00:09 What is the regression line? What we're calling a regression model, which is represented by Y i equals beta 0 plus beta 1 X i plus epsilon i where the epsilon i are all independent and normally distributed with mean zero and some common standard deviation sigma Y i is the ith value of the response variable X i is the ith value of the explanatory variable beta 0 is an intercept term for the regression line and beta 1 is the slope.

00:38 So this means that the response is modelled as a normal random variable with mean beta 0 plus beta 1 times X i and standard deviation sigma By performing linear regression, what we're doing is we're actually estimating the mean value of the response for each value of the explanatory variable.

00:57 We use the estimate Y hat i equals b hat 0 plus b hat 1 X i as an estimate of the mean.

01:06 The analysis of the residuals is in effect a check to make sure that the error terms satisfy the assumptions of this model.

01:13 So the residuals can be viewed as estimates of the error terms for each response or the deviations from the estimated mean.

01:22 So what's the goal? The goal is to use this information to make inferences about the slope of the regression line So for one to perform regression inference, we need to have some condition satisfied and they are pretty similar to the conditions that we need for linear regression We need Linearity, a scatterplot of our response against the explanatory variable should show a roughly linear pattern.

01:46 Independence.

01:48 All of our observations need to be independent to each other.

01:52 Three, Equal Variance.

01:53 wihch means we don't want any thickening in the plot.

01:57 and Four, Normal Populations.

02:00 Our residuals need to be nearly normal.

02:02 and we can assess this with a histogram or with a normal probability plot.

02:07 If these conditions are satisfied, we can carry out inference procedures for the slope of the regression line.

02:14 So how do we qualify the spread around the regression line? what we're gonna do is we're gonna use the sample standard deviation of the residuals So let's let S e denote the standard deviation of the residuals.

02:27 then S e is given by the sum of the Y i minus the Y hat i squared divided by n minus 2 A smaller value of S e indicates less spread around the regression line and a stronger relationship between the response and the explanatory variable We also need to think about the spread across the explanatory variable.

02:47 along the range of the explanatory variable provides a more stable regression line.

02:53 So let's let S x denote the standard deviation of the observed values of the explanatory variable and lets let X bar be the mean value then the standard deviation of the S x is just the sum of the X i minus X bar squared divided by n minus 1 where again n is the number of observations.

03:13 Taking all these things into account, we can estimate the variation of slope so we look at the standard error of b hat 1 as S E divided by S x times the square root of n minus 1 So for a sample size that's large enough, then we have a central limit theorem available The statistic t equals b hat 1 minus the actual slope divided by the standard error of the estimated slope follows a t distribution with n minus 2 degrees of freedom.

03:43 We'll use this to construct confidence intervals and to form hypotheses tests about the slop We can form a 100 times 1 minus alpha percent confidence interval for the slope of the regression line by taking our estimated slope plus or minus the critical t for n minus 2 degrees of freedom times the standard error of slope.

04:02 So let's do an example, Let's say that we look at body fat and 3% of a man's body is essential fat.

04:09 Supposed that we perform a regression to look at the relationship between body fat and waist circumference and we find that the estimated body fat is equal to minus 42.7+ plus 1.7 times the weight circumference The question is, is there evidence that at the 5% level that the linear relationship between body fat and waist circumference exists? We're going to assume that all the conditions of the regression model are satisfied using software we find the following summary statistics: we find that the slope is 1.7, the standard error of the estimated slope is .2350 So in order to test for a linear relationship, we must assume that there is none and then try to prove otherwise We observed 25 men.

04:55 Our hypotheses, beta 1 equals 0 so there is no linear relationship, there is no slope.

05:01 versus the alternative that there is a linear relationship where the slope is not equal to zero Conditions.

05:08 We assumed that these are already satisfied.

05:11 Now let's do the mechanics of the test.

05:14 Our test statistic is given by b hat 1 minus 0 divided by the standard error of b hat1 Which is 1.7 divided by .2350 or 7.234 Other than null hypothesis, our test statistic has a t-distribution with 23 degrees of freedom .

05:33 because we observed 25 people and our degrees of freedom are the number observed minus 2 So we have a t-distribution of 23 degrees of freedom and we're going to reject our null hypothesis if t is less than minus t23.025 or if t is greater than t23.025 The table gives a t23.025 is 2.069 Since our test statistic value is 7.324 Let me say that again...

06:07 Since our test statistic value is 7.234, which is bigger than 2.069, we reject the null hypothesis and conclude that there is evidence of a linear relationship between waist circumference and body fat percentage.

06:22 Let's construct the 99% confidence interval for the slope of this regression line.

06:28 The interval again is given by b hat 1 plus or minus t23.005 this time because we're looking for a 99% confidence interval, times the standard error of b hat 1 so we have 1.7 plus or minus 2.807 times .235 which gives us an interval of 1.040 to 2.360 So what does this tell us? Well it tells us that we're 99% confident that we expected increase of between 1.040 and 2.360 in the percentage body fat for a 1 inch increase in waist circumference.

07:08 In regression inference, there are a lot of things that can go wrong, so there's some things that we want to avoid.

07:13 First of all, do not fit a linear regression model to data that are not straight Watch out for thickening plots.

07:20 This is an indication of a violation of one of the assumptions for regression inference Make sure the error terms are normal.

07:28 If they don't look normal, then this is an indication of the violation of the nearly normal condition We want to watch out for influential points and outliers that can mess up our regression equation And finally, when performing regression inference, we want to be sure of whether we want a one-tail test or a two-tail test for the slope.

07:47 So what did we do in this lecture? Well basically what we did was we restated the regression model, looked at it from a different persepctive and described how we carried out hypotheses tests and formed confidence intervals in order to determine whether or not there's a linear relationship between the response and the explanatory variable.

08:05 we finished up by looking at the things that can go wrong in regression inference and so there are at least five pitfalls that we want to avoid.

08:12 So that's the end of lecture 10 Congratulations You've made it through Statistics 2 and I hope you've enjoyed the course, I thank you for taking it and I hope that now you have a better idea of how to perform data analysis how to carry out statistical inference and that you find it useful in your future endeavors.

The lecture Inference for Regression by David Spade, PhD is from the course Statistics Part 2. It contains the following chapters:

• Inference for Regression
• Example: Body Fat
• Pitfalls to Avoid

### Included Quiz Questions

1. The residuals can be viewed as estimates of the mean value of the response variable for each value of the explanatory variable.
2. The response variable is assumed to have a normal distribution for each value of the explanatory variable.
3. By performing linear regression, we are estimating the mean value of the response variable for each value of the explanatory variable.
4. The fitted values of the response variable are used as estimates of the mean value of the response for each value of the explanatory variable.
1. The test statistic follows a t-distribution with n−2 degrees of freedom.
2. The test statistic follows a normal-distribution with mean 0 and variance 1.
3. The test statistic follows a t-distribution with n degrees of freedom.
4. The test statistic follows a t-distribution with n−1 degrees of freedom.
1. The scatterplot of the response variable against the explanatory variable shows random scatter about 0.
2. The residuals are nearly normal.
3. The plot does not thicken.
4. The observations are independent of each other.
1. We quantify the spread around the regression line by using the standard error of the slope.
2. We quantify spread around the regression line by using the standard error of the intercept term.
3. We quantify spread around the regression line by using the mean value of the explanatory variable.
4. We quantify spread around the regression line by using only the sample standard deviation of the residuals.
1. Fitting a linear regression to data that is not linear will not have a negative effect on the inference procedures for a slope.
2. Being careful of thickening plots will not have a negative effect on the inference procedures for a slope.
3. Making sure that the residuals are nearly normal will not have a negative effect on the inference procedures for a slope.
4. Being careful of outliers and influential points will not have a negative effect on the inference procedures for a slope.

### Author of lecture Inference for Regression 