Welcome back for lecture 10 in which
we discuss inference for regression
So let's look at regression
from a dfiferent perspective.
What is the regression line?
What we're calling a regression
model, which is represented by
Y i equals beta 0 plus
beta 1 X i plus epsilon i
where the epsilon i are all independent and
normally distributed with mean zero
and some common standard
Y i is the ith value of the response variable
X i is the ith value of
the explanatory variable
beta 0 is an intercept term for the
regression line and beta 1 is the slope.
So this means that the response is
modelled as a normal random variable
with mean beta 0 plus beta 1 times
X i and standard deviation sigma
By performing linear regression, what we're doing is
we're actually estimating the mean value of the response
for each value of the explanatory variable.
We use the estimate Y hat i equals b hat 0
plus b hat 1 X i as an estimate of the mean.
The analysis of the residuals is in effect
a check to make sure that the error terms
satisfy the assumptions of this model.
So the residuals can be viewed as estimates
of the error terms for each response
or the deviations from the estimated mean.
So what's the goal?
The goal is to use this information to make
inferences about the slope of the regression line
So for one to perform regression inference,
we need to have some condition satisfied
and they are pretty similar to the conditions
that we need for linear regression
We need Linearity, a scatterplot of our
response against the explanatory variable
should show a roughly linear pattern.
All of our observations need to
be independent to each other.
Three, Equal Variance.
wihch means we don't want any
thickening in the plot.
and Four, Normal Populations.
Our residuals need to be nearly normal.
and we can assess this with a histogram
or with a normal probability plot.
If these conditions are satisfied, we can carry out
inference procedures for the slope of the regression line.
So how do we qualify the spread
around the regression line?
what we're gonna do is we're gonna use the
sample standard deviation of the residuals
So let's let S e denote the standard
deviation of the residuals.
then S e is given by the sum of the Y i minus
the Y hat i squared divided by n minus 2
A smaller value of S e indicates less
spread around the regression line
and a stronger relationship between the
response and the explanatory variable
We also need to think about the spread
across the explanatory variable.
along the range of the explanatory variable
provides a more stable regression line.
So let's let S x denote the standard deviation of
the observed values of the explanatory variable
and lets let X bar be the mean value
then the standard deviation of the S x is
just the sum of the X i minus X bar squared
divided by n minus 1 where again
n is the number of observations.
Taking all these things into account,
we can estimate the variation of slope
so we look at the standard error of b hat 1 as S E
divided by S x times the square root of n minus 1
So for a sample size that's large enough,
then we have a central limit theorem available
The statistic t equals b hat 1 minus the actual slope
divided by the standard error of the estimated slope
follows a t distribution with
n minus 2 degrees of freedom.
We'll use this to construct confidence intervals
and to form hypotheses tests about the slop
We can form a 100 times 1 minus alpha percent confidence
interval for the slope of the regression line
by taking our estimated slope plus or minus the
critical t for n minus 2 degrees of freedom
times the standard error of slope.
So let's do an example,
Let's say that we look at body fat and
3% of a man's body is essential fat.
Supposed that we perform a regression to look at the
relationship between body fat and waist circumference
and we find that the estimated body fat is equal to
minus 42.7+ plus 1.7 times the weight circumference
The question is, is there
evidence that at the 5% level
that the linear relationship between
body fat and waist circumference exists?
We're going to assume that all the conditions
of the regression model are satisfied
using software we find the
following summary statistics:
we find that the slope is 1.7, the standard
error of the estimated slope is .2350
So in order to test for a linear relationship, we must
assume that there is none and then try to prove otherwise
We observed 25 men.
Our hypotheses, beta 1 equals 0 so there is
no linear relationship, there is no slope.
versus the alternative that there is a linear
relationship where the slope is not equal to zero
We assumed that these are already satisfied.
Now let's do the mechanics of the test.
Our test statistic is given by b hat 1 minus
0 divided by the standard error of b hat1
Which is 1.7 divided by .2350 or 7.234
Other than null hypothesis, our test statistic
has a t-distribution with 23 degrees of freedom .
because we observed 25 people and our degrees
of freedom are the number observed minus 2
So we have a t-distribution of 23 degrees of freedom
and we're going to reject our null hypothesis
if t is less than minus t23.025
or if t is greater than t23.025
The table gives a t23.025 is 2.069
Since our test statistic value is 7.324
Let me say that again...
Since our test statistic value is
7.234, which is bigger than 2.069,
we reject the null hypothesis and conclude
that there is evidence of a linear relationship
between waist circumference
and body fat percentage.
Let's construct the 99% confidence interval
for the slope of this regression line.
The interval again is given by b hat
1 plus or minus t23.005 this time
because we're looking for a 99% confidence
interval, times the standard error of b hat 1
so we have 1.7 plus or minus 2.807 times .235
which gives us an interval of 1.040 to 2.360
So what does this tell us?
Well it tells us that we're 99%
confident that we expected increase
of between 1.040 and 2.360 in the percentage body
fat for a 1 inch increase in waist circumference.
In regression inference, there are
a lot of things that can go wrong,
so there's some things
that we want to avoid.
First of all, do not fit a linear regression
model to data that are not straight
Watch out for thickening plots.
This is an indication of a violation of one
of the assumptions for regression inference
Make sure the error terms are normal.
If they don't look normal, then this is an indication
of the violation of the nearly normal condition
We want to watch out for influential points and
outliers that can mess up our regression equation
And finally, when performing regression
inference, we want to be sure
of whether we want a one-tail test
or a two-tail test for the slope.
So what did we do in this lecture?
Well basically what we did was we restated the regression
model, looked at it from a different persepctive
and described how we carried out hypotheses
tests and formed confidence intervals
in order to determine whether or not
there's a linear relationship between
the response and the explanatory variable.
we finished up by looking at the things
that can go wrong in regression inference
and so there are at least five
pitfalls that we want to avoid.
So that's the end of lecture 10
You've made it through Statistics 2
and I hope you've enjoyed the course,
I thank you for taking it and I hope that now you
have a better idea of how to perform data analysis
how to carry out statistical inference and that
you find it useful in your future endeavors.