Welcome to Lecture 8, in which we'll address issues with regression assumptions.
Once we do a linear regression, we like to look at the residual plots.
And when we look at the residual plots,
what we wanna see if our regression is done well, is random scatter about zero?
If there's random scatter about zero and the residual versus fitted values plot,
this is an indication that the regression assumptions are reasonable.
Things in the residual plots that we might see that indicate violations of regression assumptions include:
a curved pattern in the residual versus fitted values plot,
outliers, and parts of the plot where the spread is bigger than it is in other places.
For example, let's look at the residuals versus fitted values plot
for the Leaf Mass Data from the regression lecture.
We see a possible outlier, we see a curved pattern, and we see places of varying spread.
So the residual shows some problems, but our original scatter plot seem to show that the relationship
between the width and the leaf mass was pretty strongly linear.
So what happened?
Often we can't see this kind of problems in the original scatter plot.
It takes examination in the residual plots to see the possible violations of the underlying assumptions.
So now the question becomes, "How do we fix these problems?"
One problem that we might run in to is groups or subsets.
Sometimes we have small clusters of residuals in different regions of the residuals versus fitted values plot.
This indicates the presence of more than one group in our data set.
How can we fix that?
Well, the easiest way is to do a separate regression for the data from each group.
For example, suppose we are looking at the relationship
between sugar content of cereal and calories in the cereal.
Since kids' cereal in the supermarket are likely to be place
on the lower shelves at the eye levels of the children, we'll probably have several groups.
These groups will be according to the shelf the cereal is on.
So we handle these things separately.
We would wanna do a separate regressions of calories against sugar for each of the shelves.
And we'll likely get several, very different models, one for each group or shelf.
Each of these models would look very different from the one that comes from analyzing all the data together.
So one of the important rules of regression is that all of the data must come from the same population.
In the cereal example they don't, because the different shelves are geard toward different demographics.
If you have separate groups, like we do in the cereal example, we wanna analyze each group separately.
We have outliers, leverage, and influential points all of which have pronounced effects on the regression model.
Outliers, again, are observations that are far away from the rest of the data.
Outliers are going to have very large residual values. We have several types of outliers.
One of them is a high leverage point, and these have x-values that are far away from the average x-values.
So you might have one observation with an x-value that's way out to the right,
or way out to the left compared to the rest of the data.
And this can make a linear relationship appear much stronger than what it actually is.
So it's best to fit the model with all the points first,
and then try it again without the high leverage point to see how the model changes.
We might have what we call an Influential Point.
And this is a point whose removal from the data set greatly changes the regression equation.
So we need to handle this in the same way that we handle the high leverage points.
We need to conduct the analysis with all the points in the data set,
and then do it again without the possible influential points to see how the model changes.
If the model changes a lot, then the point is in the influential point.
For example, let's look at this scatter plot for exit polling we're looking at each county in a particular state.
Notice in the scatter plot that we have one high leverage point way out to the right,
and one possible influential point with a y-value that's really high.
We have two candidates and we're doing a linear regression of 1's vote count versus the others.
So we're gonna do two different regressions. Let's include all the points first.
The regression equation we get out is the estimated candidate 2 vote count
is 414.2601 plus 0.99474 times the vote count for candidate 1.
The correlation between the vote count for candidate 2 and the vote count for candidate 1 is 0.6392.
Now, let's take out the high leverage point at 21,000 for candidate 1.
The new regression equations is the estimated candidate 2 vote count
is 224.7384 plus 0.14772 times the vote count for candidate 1. The correlation's 0.5652.
Now, let's try it one more time where we remove the influential point
where candidate 1 has a vote count of 6,800 votes.
The new regression equation without that influential point
is the estimated vote count for candidate 2 is 388.6 plus 0.0079 times the vote count for candidate 1.
The new correlation is 0.9226. So what happened?
The high leverage point increase the correlation, and the influential point greatly decrease the correlation.
Both situations have a significant impact on the slope with the regression line and, as well as, the intercept.
Another problem we might run into is Lurking Variables.
We need to be careful of those, because no matter how high the correlation is,
we cannot infer cause from observational data.
If we have a strong linear relationship between two variables,
that does not imply that changes in one cause changes in the other,
because we can't be sure that a variable isn't hanging out in the background
that's actually the cause of the association.
So we have these problems in our data and with our regression assumption,
and now we need to know how to fix them.
A common way to fix these problems with the regression assumptions, is to transform one or both variables.
And this will help fix problems in the scatter plot or in the residuals versus fitted values plot.
So what are the goals of transformations?
Well, one goal is to make the distribution of a variable more symmetric.
And we can asses whether or not this is work by using a histogram.
We might wanna make the spread of multiple groups more alike, even if their centers differ.
And we can assess whether or not this is work by looking at side-by-side boxplots.
We may, also, wanna make the form of a scatterplot more nearly linear.
And we might wanna make the spread in the scatterplot more even through out the plot
instead of having thickening parts of the plot, and thinner parts of the plot.
So we need to know what transformations are appropriate to fix which problems.
So we have this concept called the Ladder of Powers.
And we have 2, 1, 1/2, 0 , and -1.
Two, is where we square the response variable values.
And where this is useful, is if we have unimodal distributions that have a skew to the left.
One, means no change.
This is the raw response, we're not doing anything to the response variable, this is our "home base".
This is our original data.
One-half, represents the square root transformations,
so we're taking the square root of all the response values.
And this is really good for count data.
Zero, is the log transformation, so we're taking the natural log rhythm of our response values.
Where this is useful, is in measurements that cannot be negative,
and for values that grow by percentage increases.
For instance, salaries, and populations.
Negative one, that's the negative reciprocal.
This is -1 over the response, and this is good for changing the direction of a relationship
or reversing the original ratio of how the response values are measured.
So for example, let's suppose we have the following data set with 15 observations.
So there's all the x-y pairs.
And now we look at the scatterplot, and what do we see?
We see clear curved pattern in there,
and it looks like y might be related to the explanatory variable through the relationship y = x squared.
So if we wanna use linear regression, how might we fix this problem?
Well, the natural idea would be to take the square root of the responses,
and then plot those against the x-variables.
So here's the scatterplot once we've made that transformation.
And now we can see that it looks much more linear than it did before.
So we might wanna try a regression of the square root of y against x.
So for the transformations, this transformations appears to have fix the relationship between x and y
so that it's more nearly linear.
So this gives us an indication that we might be able to model the square root of y
using x as an explanatory variable with linear regression.
We still have to examine the residual plots after our analysis is complete
in order to see if any violations of the regression assumptions are apparent.
Sometimes, the ladder of powers transformations don't work well,
so we focus on the logarithmic transformation, an there are three types of them.
One of them is one of the one that's mentioned in the ladder of powers,
that's the zero transformation, taking the log of the natural response.
We call this the exponential transformation.
And this is useful for values that grow by percentage increases.
We might also transform the explanatory variable by taking the natural log of the access.
This is known as the logarithmic model, and this is useful when the scatterplot descends rapidly at the left,
but then levels of at the right.
And finally, there's the in between transformation, known as the power transformation.
And for this transformation, what we do is we take the natural log of both the explanatory variable,
and the response variable.
And this is good when neither of the transformations of the response,
nor the explanatory variable work well by themselves, but we need something in between the two.
So common issues with the regression assumptions are things that we might run into,
and things that we need to be aware of is, first of all,
we need to make sure that the relationship between our response variable or transformed response variable,
and our explanatory variable is straight, it's linear.
We need to look out for different groups in the regression analysis.
We do not want to extrapolate.
Again, that means don't try to make predictions for values
-- of the response that correspond to values of the explanatory variable
that are outside the range of what you observed.
We need to be careful to look for unusual points, high leverage points, and influential points.
We need to consider two regression models to examine if the impact of unusual points is strong in the model.
We need to be careful if our data have multiple modes, because this can indicate groups.
We need to be aware of lurking variables,
because even though the relationship between two variables might be strongly linear,
there might be something hanging out in the background that might be causing that association,
and it doesn't necessarily mean that changing the value of the explanatory variable
is causing the change in the response.
We need to be careful not to use regression to imply cause.
And don't ever expect your model to be prefect, none of them are.
We don't wanna stray too far away from the ladder of powers when we're trying to transform data.
Again, we don't wanna choose a model based only on R-squared
because correlation might be affected and R-squared in turn, may be affected by outliers,
or influential points which may increase the value of R-squared dramatically.
And finally, one more time, be careful if your data have multiple modes,
because this might indicate groups,
and so you might wanna do separate regression analysis for all these groups.
All right, these are the common issues that come up in checking the regression assumptions.
We've learned how to deal with transformations, and how to try to make our data more linear,
and how to fix problems with the regression assumptions.
This is the end of Lecture 8, and we'll see you back here for Lecture 9.