Welcome to Lecture 6. Where we're gonna discuss Scatterplots and Correlation.
Let's introduce ourselves to these two concepts.
When we look at scatterplots and correlation,
what we're interested in doing is examining the relationships between two quantitative variables.
We can graphically represent this relationship using what we know as a scatterplot.
In order to make a scatterplot, what we have to do is we simply put the values of one variable
going across the X-axis and the values of the other variable going along the Y-axis.
What can we tell from a scatterplot?
Well, one thing that we can tell is what happens to the value of one variable
as the value of the other variable increases?
We can also tell whether or not the relationship between the two variables is linear.
These are examples of what we call associations between two quantitative variables.
Here's an example of what a scatterplot looks like.
We're looking at an example here of examining the relationship between the age of a car in years
and the mileage of the car in thousands.
We observe 10 cars with ages and miles given by the points that you see here
and we make a scatterplot with the age of the car on the X-axis and the mileage of the car on the Y-axis.
This is what it looks like.
Now, what can we say about the relationship between the age of the car and the mileage of the car?
The scatterplot tells us that the mileage of the car tends to increase as the age increases
which is what we might expect or the car should have more miles on him.
The data seem to fit a roughly linear pattern.
In other words, there appears to be a linear association between the age of the car and the miles on the car.
What else do we wanna look for when we examine a scatterplot?
Well, we wanna look at the direction of the relationship;
we wanna look for what we call the form; we wanna look for what we call the strength of the relationship;
and we wanna look for outliers or subgroups.
Direction, which way does the association go? Direction can be positive, negative, or neither.
If the relationship is in a positive direction, then that means that as the value of one variable increases,
the value of the other variable increases.
If the relationship goes in a negative direction,
then as the value of one variable increases, the value of the other variable decreases.
They're going on opposite directions.
If the relationship doesn't go in either direction,
then there are some places where the relationship is positive
and some places where the relationship is negative
or there's just no clear positive or negative association between the two variables.
In our example, age and mileage appear to increase together,
so we would say that they have a positive association.
What if we plot one over the mileage against the age?
Then the association turns negative as we can see in the scatterplot.
As the age increases, one over the mileage decreases.
Now, we move on to the form of an association. Is it linear or is not linear?
The form of an association between two variables refers
to whether the association between them is linear, curved, or if there is no pattern.
For example, the association between age and miles appears to be roughly linear.
The association between one over the miles and the age of the car seems to be curved; it comes down like that.
If there is no association between two variables,
the Scatterplot should show something like a cloud pattern should we random scatter the points.
Here's an example of the scatterplot that shows no clear association.
It just looks like a cloud of points put in a plot.
We see no clear association between them; no direction, no form, nothing.
Next thing, we assess when we look at a scatterplot is the strength of an association.
In other words, how closely do the points fit a particular pattern?
So when we talked about the strength of an association between two quantitative variables.
We're basically asking ourselves how tightly-clustered the points are in the form of the relationship.
In the scatterplot, we see a strong association and it's a linear association
and these points are pretty tightly-packed around the line.
If we wanna look at a weak linear association, we can look at this example here.
We see that there's a roughly linear pattern
and it does appear to go on a positive direction but this linear association is very, very weak.
What about outliers?
An outlier is one observation that stands far away from the overall pattern of the scatterplot.
In the plot that we see here, we have a strong linear pattern between X and Y
but then we have one point that has an extremely high Y value around Y = 4 or 5.
We would also be interested in looking for subgroups.
When we talk about Subgroups, what we mean is the cluster of observations
that stands away from the rest of the plot or that trends in a different direction from the rest of the plot.
Here, we have an example of subgroups and this particular example satisfies both of those.
We have one negative linear relationship and then one small positive linear relationship,
so we have two subgroups going a different directions and are separated from each other.
There's two clusters there.
We have the X and the Y variables but we have to make sure we determine accurately which one is which.
Which do we call X and which do we call Y?
Some possible questions where this might come up:
For example, do baseball teams that score more runs sell more tickets to their games?;
Do older houses sell for less than newer ones of comparable size and quality?;
Do students who score higher on their SAT have higher grade point averages in college?
The important thing here is that if we understand the question that we wanna answer,
the question that we wanna answer tells us which variable we wanna call X and which variable we wanna call Y.
In other words, which variable goes on which axis.
In order to determine which variable goes on which axis, we have to understand which role each variable plays.
We have two variables that we're interested in; one is the response variable.
The response variable is the one that we measure; it's the variable that goes on the Y-axis.
This is the one whose value we expect to change based on changes in the other variable.
X is the explanatory variable and it corresponds to the number of runs scored.
We're looking to see if the number of runs scored elicit to change in the number of ticket sales.
That makes why the ticket sales the response variable.
Now we know which variable to plot on the Y-axis and which variable to plot on the X-axis.
What about the next question?
Do older homes sell for less than newer ones of comparable size and quality?
Well, here we're interested in how the sale price of a home changes with its age,
with all of the things being about equal.
Here, the age of the house is the explanatory variable.
It would go on the X-axis. The sale price is the response variable.
That would be the one that goes on the Y-axis. What about the third question?
Do students who score higher on their SAT have higher grade point averages in college?
Well, here we're interested in seeing GPA changes with variations in the SAT score.
Thus, in this case, the SAT score's the explanatory variable and the response variable is the college GPA.
Now, we have a formal measurement of the strength of a linear association between two quantitative variables.
We call this measure correlation.
Formally, what correlation does is it measures the strength of the linear association between two quantitative variables.
Notice how linears involve
Correlation applies to linear relationships; curved relationships.
For curved relationships, Correlation is not an appropriate measure of the strength.
Correlation also tells you the direction of the linear relationship.
Correlation has the property that it takes a value between -1 and 1.
Now, the question is, once we have that number, how do we interpret it?
Well, the strength of the linear association
between two variables is given by the absolute value of the correlation.
Then if 2 variables have a correlation of -0.85,
their linear relationship is just as strong as the linear relationship between two variables
whose correlation is +0.85.
The sign of the correlation is what tells us the direction.
A negative correlation means a negative relationship
and a positive correlation means a positive direction. How do we calculate correlation?
Well, first of all, we need to know that correlation has no units.
To ensure that units don't matter, we have to standardize each X and Y value.
What we do is we calculate the Z scores for each X and the Z scores for each corresponding Y.
We also calculate the sample standard deviations of the Xs and the sample standard deviations of the Ys.
Once we have the Z scores for each X and for each Y,
we take the product of the Z scores for each X and the corresponding Y and we add up these products.
We then divide this by N minus one where N is the number of observations.
Okay, so mathematically, here's the formula.
R is equal to this sum. We basically take the sum of the products of the Z scores divided by N minus one.
R is the correlation. Let's do an example of where we calculate Correlation.
Okay, so using the data from before with age as the explanatory variable
and mileage as the response variable, we're gonna get the following ZX-ZY pairs.
We'll just do it for Z-one or for the first observation and then we'll just list the rest of them.
We know that the sample mean for the Xs is 2.69, the sample mean for the Ys is 29.33,
the sample standard deviations for the Xs is 1.46, and the sample standard deviations for the Ys is 20.471.
We calculate the Z score for X-one which is one minus 2.69
over the standard deviation of the Xs and the Z score for Y-one
which is 10.5 minus 29.33; the mean divided by the standard deviation.
We get the Z score or combination of minus 1.1598 minus 0.9198.
We do the same thing for the other 9 pairs.
Down below you see the list of all the pairs of Z scores that we have from our ten observations.
Let's finish it. We have the Z scores, so now, we can find the product of the Z scores.
We get 1.0688 by multiplying the first two Z scores and then the remaining nine are given in green at the end.
We add up all those products and what we get is 8.634. Note that that's positive.
That's important. Now, we have to divide by the number of observations -1
We divide by nine and that gives us a correlation of 0.9593. What that tells us then,
the Correlation between mileage and age is 0.9593.
What this indicates is a strong linear relationship and a positive direction between the age of a car and its mileage.
Before we use correlation, we have to make sure that it's an appropriate measure
of the strength of the relationship between two variables.
Our data have to satisfy three conditions in order for Correlation to be helpful.
First of all, we have to make sure that our variables are quantitative.
Secondly, we have to satisfy the straight enough condition.
In other words, if we look at a scatterplot, the relationship between the two variables should be roughly linear.
This is a judgment call but it's usually a fairly easy one to make.
We also don't want any Outliers.
If we look at a scatterplot and we see a pair of Outliers, this can really mess up our correlation.
This can make a weak linear association appear strong
or it can make a strong linear association appear to be fairly weak.
In general, outliers can really distort what the Correlation is
and give false impressions about the strength of the linear relationship between two quantitative variables.
There are certain properties that come with correlation.
We can learn a lot from it. There are seven key characteristics of a correlation.
The first, which we've already addressed,
is that the sign of the correlation gives the direction of the association between the two variables.
We also mentioned that correlation is always between -1 and 1
It can be exactly -1 or +1
which would indicate that all of the data fall on a single, perfect, straight line.
Correlation is a symmetric quantity.
In other words, the correlation between X and Y is exactly the same as the correlation between Y and X.
Correlation has no units.
There's no units attached to correlation, it's just a number.
Since correlation has no units, it's not affected by a change in the scale or center of either variable.
Correlation measures the strength of a linear association between two variables.
Thus, two variables might be strongly associated but may have small correlation if the association is not linear.
Correlation is highly sensitive to outliers.
A single Outlier can make a small Correlation large or a large correlation quite small.
One thing that we wanna really be careful of is associating correlation and cause.
These are not the same thing. We ask ourselves:
Does a high correlation mean that a change in one variable causes a change in the other?
We have a tendency of saying that.
Since because two variables are highly correlated or very closely related,
it seems to indicate to us that the change in one causes the change in the other.
This isn't necessarily true.
For instance, here's an example.
If we look at house fires, there's a positive correlation
between the number of firefighters at the scene and the amount of damage it was done.
That means that the more firefighters we have, the more damage tends to be done.
Does that mean that more firefighters cause more damage?
Does this mean that because there's more damage with more firefighters we shouldn't call the fire department?
Now, I mean that they do cause some, certainly with more firefighters,
there's more water and they're chopping more holes
but the size of the fire certainly impacts the number of firefighters that are called and the amount of damage.
This is what we call, the size of the fire is what we would call a lurking variable.
Let's look at what lurking variables are.
Those are the kinds of variables that hide out in the background and we don't see them.
A lurking variable is a hidden variable that stands behind the relationship
between the two quantitative variables that we're interested in
and determines that relationship by simultaneously affecting both of the variables.
In our example, the size of the fire is a lurking variable.
This can have huge effects on the relationships between two variables.
In many cases, the cause of the relationship between two quantitative variables
is the product of a lurking variable and not necessarily
the explanatory variable affecting the value of the response.
This is why we can't necessarily say that a high correlation is an indication of a cause of relationship
between the explanatory variable and the response.
In using correlation, there are a bunch of mistakes that we can often make.
We wanna be careful of that. First, do not say correlation when you mean association.
Those aren't the same thing.
Association means a relationship.
Correlation is a numerical measure that quantifies the strength of a linear relationship between two variables.
Correlation quantifies the strength of the linear association where association is a relationship.
Do not try to find the correlation between two categorical variables.
It doesn't make any sense to do that. Don't confuse correlation with causation.
This is one of the biggest issues that we run into with Correlation and it's the one that we just addressed.
Make sure that the association between your variables is a linear relationship or a linear association
before you try to use correlation to describe the strength of the relationship.
Do not assume that the relationship is linear simply because the correlation is high.
We might have a non-linear relationship and that an outlier that makes the correlation high, for instance.
Also, don't assume that it's non-linear because the correlation is low. Outliers can also cause that problem.
Outliers can have a significant impact on a correlation and, in fact, they often do.
That's why it's best to not just rely on the correlation
to assess the relationship between two quantitative variables
but you also wanna look at the scatterplot to see if there are outliers affecting things.
These are the common issues when using correlation.
We wanna be careful to avoid these.
This is the end of Lecture 6 about scatterplots and correlation and we'll see you next time for Lecture 7.