Welcome to Lecture 7, where we're going to discuss Linear Regression.
All right, so when we talk about the linear model.
What we mean is we wanna describe relationships between quantitative variables with a line.
How do we do it and what does it mean?
Well, sometimes our goal is to model the relationship between two quantitative variables with a line.
While it's unlikely that any straight line will pass through all of the data points,
under the right conditions, a line that goes through our data
can give a close approximation to this relationship.
What the linear model is, is the equation of a straight line that goes through our data.
Our data are imperfect, our line doesn't go through all the data points,
so there's going to be some error which we call residuals.
When we have deviations from the data to the line, that's what we mean by a residual.
We call our observed response values y and our predicted values, based on the line, y hat.
We'll talk about -- talk more later about how we get these predicted values.
The distance between y and y hat is known as the residual for that observation.
It tells us how far off the model's prediction that point is.
For example, if we have y = 22, if that's what we observe,
and our model predicts y = 26.5, then the residual for that observation is 22 minus 26.5, or -4.5.
Residuals can be viewed as the vertical distance from the observed value to the line.
Here's a picture of what it looks like. We have in red, the observed values.
We have in green, the predicted values based on the line.
The length of those lines that you see connecting these points represent what we call the residuals.
Now, we wanna talk about the regression line.
We wanna know about the line that best fits our data.
So what is regression line?
Well, first of all let's think about a line that's a poor fit to the data, this is gonna show large residuals.
A line that's a better fit to the data will have smaller residuals.
We could try to find the line that minimizes the sum of all the residuals,
but the negative residuals cancel out the positive ones, so the sum of the residual is always 0,
so we can't do that. What do we do?
Instead, we find the line that minimizes the sum of the squares of all the residuals.
This result is known as the line of best fit or the least-squares line.
We've talked about the correlation to measure the strength of the linear relationships,
and now we're talking about modelling our data using a line.
It's seems like those two might be related, but how?
What does correlations tells us about the regression line?
Well, it tells us a lot.
For instance, it can tell us whether the line has a positive or negative slope.
If the correlation is negative, then the line has a negative slope.
If the correlation is positive, then the regression line has a positive slope.
In many cases the correlation can give some insight
in to how well our linear model fits our data,
but we need to be careful in doing this, and we'll talk more about that a little bit later.
How do we find it? Well, here's an equation of the regression line.
It takes the form, y hat which is our predicted value equals b0 plus b1
times the value of our explanatory variable.
So b0 is an intercept for the line, b1 is the slope.
Okay, so to find those b0 and b1 values, let's start with a slope, which is b1.
If r's are correlation, and sx and the sy are the standard deviations of the x's and y's, respectively.
Then, we can find the slope by just taking the correlation times the standard deviation of the y's
over the standard deviation of the x's.
The slope tells us the number of units by which we expect the value of our response
to increase with a one-unit increase in x.
Now, we have to calculate the y-intercept, and we use the slope in this calculation.
If x bar is the mean of all the x values, and y bar is the mean of al the y values,
then what we have is b0 equals the mean of the y's minus the slope times the mean of the x's.
The y-intercept, in some cases, tells us what value of y we should expect when x equals 0.
But again, we have to be careful here. Why?
Because it doesn't necessarily makes sense for x to be 0.
And if it doesn't make sense for x to be 0, then the intercept has no meaningful interpretation.
It's just where the regression line crosses the y-axis.
Let's open an example. Suppose we don't have a scale that's available to weigh a sugar maple leaf,
but we have a ruler available to measure its width.
We aim to estimate leaf mass in grams by using the width and centimeters as a predictor.
We have the following data available from past repetitions of this experiment.
Here are the widths of the leaves. Here are the masses of the leaves.
Before we go directly for a linear regression, we need to see if it's an appropriate thing to do.
Let's look at the scatterplot of the width versus the leaf mass.
What do we see? Well, we see a clear linear pattern in a positive direction,
and it seems that a line would fit pretty well.
Let's obtain the equation of the regression line.
We have the correlation at 0.9113, the standard deviation of the widths at 1.3771,
and the standard deviation of mass at 0.1493.
The mean width is 9.295, and the mean mass is 0.4968.
We can use all these and find b0 and b1 and put together our regression line.
So let's do that.
The slope of the regression line is the correlation times the standard deviation of the mass,
over the standard deviation of the width, and that gives us 0.0988.
So what does that mean? That means that we expect for a 1 cm increase
in the width of the leaf, an increase of 0.0988 grams in the mass.
The y-intercept is given by the mean mass minus the slope times the mean width, and that gives us -0.4215.
This is the equation of our regression line. Predicted mass is -0.4215 plus 0.0988 times the width.
Let's give context to what we just did.
Here is an example where the intercept term the b0 doesn't have a meaningful interpretation,
because a leaf can't have a width of 0, for 1.
And even if they could, it doesn't make sense for a leaf to have a negative mass.
The slope tells us that for 1 cm increase in the width of the leaf,
we expect a 0.0988 gram increase in the mass.
The linear model enables us to make predictions about the mass of the leaf,
given the value of the width, as long as the width is between 6.9 cm and 12.1 cm.
Let's be very careful here.
We cannot use the line to make predictions for leaves with widths outside this range.
And we're gonna talk about, why?
We call making predictions outside of the range of the explanatory variable values that we observed, we call that extrapolation.
It's the use of the regression equation to make predictions about the value of a response variable
to correspond to value of the explanatory variable that's outside the range of what we observed.
And let's think about why this is bad? There's a bunch of reasons it is not a good idea.
First of all, we only observe a linear pattern between the explanatory variable and the response
in the range of the values of the explanatory variable that we observed.
We have no idea what happens.
We might have little squigglies on each side of the line once we get outside of that range, we have no idea.
We might have a curved pattern or it might start decreasing on one side and increasing on the other.
We have no idea what happens.
We can't be sure that our predictions are useful if we haven't observed data in those regions.
All right, so for those reasons, it's best not to use the regression equations
to make predictions outside of the range of the data that you observed.
For example, let's try to make -- let's use the regression equation that we found
to make a prediction about the mass of the leaf that's 1 cm wide.
We do that, we get -0.4215 plus 0.0988 times 1, which gives us -0.3227,
the negative mass for a leaf that has some positive width, that makes no sense here.
Our regression line give us a nonsense value.
This is an example of why you shouldn't use the regression line
to make predictions outside of the range of the data that you observed.
Now, let's look at making predictions using the regression line.
What we just did was we made a prediction of sorts for value outside the range of our data.
But what about for values inside the range of our data?
We learn that making prediction for values outside of the range of our data is not a good thing to do,
but it's perfectly fine for values inside that range.
For example, let's predict the mass of a leaf that has a width of 8.3 cm.
Well, we get estimated mass minus 0.4215 plus 0.0988 times 8, gives us an estimated mass of 0.3689 grams.
This seems like a sensible value.
Given the pattern, 0.3689 grams seems to be pretty reasonable for a leaf that has width of 8.3 cm.
Now, let's look at the scatterplot with the regression line overlaid.
We have all our data points, and then we have the line that goes through it.
It appears that the lines fits the data fairly well,
there's some variation around the line, which is to be expected,
but it's not a bad fit to the data that we observed.
But how can we asses how well the model performs?
Well, the best way to do it is to look at how much variation there is around the line.
And to quantify that, we look at the residuals.
Given the distance of most of the points from the line,
and what we observed in the scatterplot, our model looks to do pretty well.
But when we examine the residuals, that's where we get a formal measure of how well our models actually performing.
What we're gonna do now is just an example.
We're gonna find the residual for a leaf with width 9.6 cm.
We observed that the mass was 0.587, so this is one of the leaves in the original data set that we had.
The predicted mass, is -0.4215 plus 0.0988 times 9.6, which gives us an estimated mass of 0.52698 grams.
Our residual then is the observed mass, minus the estimated mass, which is 0.0602.
What we wanna do, is we wanna look at the distribution of the residuals.
How are they distributed?
In order to see how well our model performs.
It's best to revisit the residuals and look at their distribution.
In order to do this, we're gonna look at histogram.
What we hope to see is something that looks pretty close to a normal distribution.
In other words, unimodal and symmetric.
We can also look at the amount by which the residuals vary by examining their standard deviation.
We calculate the standard deviation of the residuals,
by summing up the squared differences between the observed values, and the predicted values,
divided by n minus 2, where nz number of observations and taking the square root of that,
that gives us a standard deviation of the residuals.
If we were to do that calculation for the data that we observed, and the predictions that we made,
what we would find is that the standard deviation of the residuals is 0.0631.
The histogram that's gonna pop up here to the right, does seem to deviate from the normal distribution.
As we can see, it's got a lot -- it looks to be skewed to the left a little bit,
but with an outlier down there on the left-hand side.
That's an indication that our data -- that our residuals may not be normal.
Let's look at a normal probability plot.
Well, we see that we have some pretty clear deviations from the linear pattern
that we want in the normal probability plots.
Given what we've seen in the histogram, and the normal probability plot,
it doesn't look promising that our residuals would be normally distributed.
Let's look at the the residuals just a little bit more closely.
Again, neither the histogram nor the normal probability plot
give any confidence that our residuals are normal, it's just want we want.
Despite this, the models does seem to perform pretty well in predicting mass based on width.
The deviation from the normal distribution might be a result of the fact
that we only have 20 observations.
Perhaps, if we have more our data would look more normal.
In terms of assessing how our model performs, it's important to know how much of the variation
in our observations the model actually accounts for.
We have this R-squared quantity.
We use this because it's important to know that percentage,
and how much of the variation is accounted for.
This R-squared quantity gives us exactly this measurement.
R-squared normally, is the percentage of the variation in the response variable
that is explained by the linear regression on the explanatory variable.
It's easily calculated by squaring the correlation.
Okay, so for example, in our problem, R-squared is equal to the correlation square,
which is 0.9113 squared, or 0.8305.
What this means is that 83.05% of the variation in leaf mass is explained by the linear relationship with width.
Now, the question is, what kind of R-squared values indicate a good fit?
There's no good rule of thumb for that, it kinda depends on the data you have,
and what field your working in.
We know that R-squared is always between 0 and 100%
100% implies a perfect fit of the line to the data.
100% is too good to be true for real data.
But typically, in scientific experiments they aim for R-squared between 80% and 90%.
Observational studies, usually have lower R-squared values.
Typically, between 30% and 50%, that's often used as evidence
to support a useful linear regression and observational studies.
All right, so when does linear regression appropriate?
Well, there's four conditions that need to be satisfied.
The first is the quantitative variable condition, which should look familiar from --
when we talked about correlation.
All that quantitative variable condition says is that both the explanatory and the response variable must be quantitative.
If one of them is categorical, then stop. Don't use regression.
We need to satisfy the straight enough condition, which means that we need to use the scatterplot
to see if the relationship between the explanatory variable and the response variable is roughly linear,
or straight enough to model it with a line.
We need to check the outlier condition,
which simply means look at the scatterplot and see if there are any apparent outliers.
We need to know if the plot thickens.
In other words, is there a lot more spread in some parts of the scatterplot than there are in others?
Is there a lot more spread around the line?
One example might be for going up the regression line, and we have a little bit of spread here,
and then there's a whole bunch in the middle, and then it thins out again,
that's an example of where the plot would thicken somewhere, and this is bad news for us.
If this happens, you don't wanna use linear regression.
We need to check all these conditions using the scatterplot before we carry out linear regression.
Here's the example from before.
From the scatterplot to the right, we're gonna assess these four conditions.
Using the leaf data, we know that both of those variables, width and mass, are quantitative,
so that condition is okay.
The scatterplot appears to be roughly linear,
so it seems straight enough to use a line to model a fit.
There did not appear to be any clear outliers in our scatterplot.
There's no real dramatic thickening of the plot at any point.
In this case, regression seems to be appropriate.
What do we do after regression?
Well, that's where we examine the residuals
to make sure the assumptions of the regression model are satisfied.
We wanna make a plot of residuals against the fitted values.
We're gonna check to see if there are any of the following things,
any bends in the plot that might indicate a violation of the straight enough condition,
any outliers that were not apparent before,
any change in the vertical spread from one part of the plot to the other.
What a good looking residuals versus fitted values plot would be would just be random scatter about 0.
Here's the residuals versus fitted values plot from our leaf mass data.
Now, we see somethings that we didn't quite pick up in the original scatterplot of the data.
We see a pretty big bend there in the the middle,
and that might be an indication that the straight enough condition
isn't satisfied like we thought it was from looking at the scatterplot itself.
We see the large bend. There does, also, appear to be a thickening of the plot
around where the fitted values between 0.5 and 0.6.
There may be some violations of the straight enough condition,
and the does the plot thicken condition that we need to worry about.
In regression, just like with any other statistical practice, there are common mistakes that happen.
Let's look at things that we need to be cautious of, and things that can go wrong.
First of all, do not use linear regression to model a non-linear relationship.
We wanna be aware of outliers, because remember, the slope of the regression line depends on the correlation,
and correlation is highly influenced by outliers.
These can cause a lot of problems and give us bad predictions.
Do not say, just like with correlation, that changes in x cause changes in y
just because there's a strong linear relationship between them.
Don't choose a model based on R-squared alone.
Remember that in our leaf mass data, we had -- R-squared was fairly high.
But the residuals versus fitted values plot
seemed to indicate that there are problems with the regression assumptions.
We don't wanna choose a model based on R-squared alone.
The reason is that sometimes, we have a weak linear relationship,
but then one outlier that makes the correlation appear pretty high, that makes R-squared high.
In other words, this high R-squared value might indicate a good linear --
a good fit even when our data aren't linear, but there's one outlier out there kinda pulling the strings.
Make sure you know which variable is your x variable, and which variable is your y,
so which is your explanatory, and which is you response.
Don't try to use your regression line to make predictions about x based on y.
In other word, don't try to predict you explanatory variable based on your response.
These are the common issues that we run into with regression.
You wanna avoid these at all times,
and if you can do that you'll probably able to find a pretty good fit to your data.
That's the end of Lecture 7 on Linear Regression, and we'll see you next time for Lecture 8.