Standardizing Data and the Normal Distribution Part 2

by David Spade, PhD

Ask Questions

Take Notes

Download Slides

Report Mistake

Comments

show all Show less

My Notes

show all

Learning Material 2

PDF

Slides Statistics pt1 Standardizing Data and Normal Distribution pt2.pdf
PDF

Download Lecture Overview

show all

Report mistake

Transcript

00:01 Welcome to Lecture 6. Where we're gonna discuss Scatterplots and Correlation.

00:05 Let's introduce ourselves to these two concepts.

00:09 When we look at scatterplots and correlation, what we're interested in doing is examining the relationships between two quantitative variables.

00:16 We can graphically represent this relationship using what we know as a scatterplot.

00:22 In order to make a scatterplot, what we have to do is we simply put the values of one variable going across the X-axis and the values of the other variable going along the Y-axis.

00:33 What can we tell from a scatterplot? Well, one thing that we can tell is what happens to the value of one variable as the value of the other variable increases? We can also tell whether or not the relationship between the two variables is linear.

00:46 These are examples of what we call associations between two quantitative variables.

00:53 Here's an example of what a scatterplot looks like.

00:57 We're looking at an example here of examining the relationship between the age of a car in years and the mileage of the car in thousands.

01:07 We observe 10 cars with ages and miles given by the points that you see here and we make a scatterplot with the age of the car on the X-axis and the mileage of the car on the Y-axis.

01:20 This is what it looks like.

01:21 Now, what can we say about the relationship between the age of the car and the mileage of the car? The scatterplot tells us that the mileage of the car tends to increase as the age increases which is what we might expect or the car should have more miles on him.

01:37 The data seem to fit a roughly linear pattern.

01:41 In other words, there appears to be a linear association between the age of the car and the miles on the car.

01:46 What else do we wanna look for when we examine a scatterplot? Well, we wanna look at the direction of the relationship; we wanna look for what we call the form; we wanna look for what we call the strength of the relationship; and we wanna look for outliers or subgroups.

02:03 Direction, which way does the association go? Direction can be positive, negative, or neither.

02:11 If the relationship is in a positive direction, then that means that as the value of one variable increases, the value of the other variable increases.

02:20 If the relationship goes in a negative direction, then as the value of one variable increases, the value of the other variable decreases.

02:29 They're going on opposite directions.

02:31 If the relationship doesn't go in either direction, then there are some places where the relationship is positive and some places where the relationship is negative or there's just no clear positive or negative association between the two variables.

02:43 In our example, age and mileage appear to increase together, so we would say that they have a positive association.

02:50 What if we plot one over the mileage against the age? Then the association turns negative as we can see in the scatterplot.

02:58 As the age increases, one over the mileage decreases.

03:02 Now, we move on to the form of an association. Is it linear or is not linear? The form of an association between two variables refers to whether the association between them is linear, curved, or if there is no pattern.

03:17 For example, the association between age and miles appears to be roughly linear.

03:23 The association between one over the miles and the age of the car seems to be curved; it comes down like that.

03:30 If there is no association between two variables, the Scatterplot should show something like a cloud pattern should we random scatter the points.

03:40 Here's an example of the scatterplot that shows no clear association.

03:44 It just looks like a cloud of points put in a plot.

03:47 We see no clear association between them; no direction, no form, nothing.

03:53 Next thing, we assess when we look at a scatterplot is the strength of an association.

04:00 In other words, how closely do the points fit a particular pattern? So when we talked about the strength of an association between two quantitative variables.

04:07 We're basically asking ourselves how tightly-clustered the points are in the form of the relationship.

04:14 In the scatterplot, we see a strong association and it's a linear association and these points are pretty tightly-packed around the line.

04:23 If we wanna look at a weak linear association, we can look at this example here.

04:29 We see that there's a roughly linear pattern and it does appear to go on a positive direction but this linear association is very, very weak.

04:37 What about outliers? An outlier is one observation that stands far away from the overall pattern of the scatterplot.

04:45 In the plot that we see here, we have a strong linear pattern between X and Y but then we have one point that has an extremely high Y value around Y = 4 or 5.

04:57 We would also be interested in looking for subgroups.

05:02 When we talk about Subgroups, what we mean is the cluster of observations that stands away from the rest of the plot or that trends in a different direction from the rest of the plot.

05:10 Here, we have an example of subgroups and this particular example satisfies both of those.

05:17 We have one negative linear relationship and then one small positive linear relationship, so we have two subgroups going a different directions and are separated from each other.

05:28 There's two clusters there.

05:30 We have the X and the Y variables but we have to make sure we determine accurately which one is which.

05:37 Which do we call X and which do we call Y? Some possible questions where this might come up: For example, do baseball teams that score more runs sell more tickets to their games?; Do older houses sell for less than newer ones of comparable size and quality?; Do students who score higher on their SAT have higher grade point averages in college? The important thing here is that if we understand the question that we wanna answer, the question that we wanna answer tells us which variable we wanna call X and which variable we wanna call Y.

06:11 In other words, which variable goes on which axis.

06:14 In order to determine which variable goes on which axis, we have to understand which role each variable plays.

06:23 We have two variables that we're interested in; one is the response variable.

06:27 The response variable is the one that we measure; it's the variable that goes on the Y-axis.

06:33 This is the one whose value we expect to change based on changes in the other variable.

06:39 X is the explanatory variable and it corresponds to the number of runs scored.

06:45 We're looking to see if the number of runs scored elicit to change in the number of ticket sales.

06:52 That makes why the ticket sales the response variable.

06:56 Now we know which variable to plot on the Y-axis and which variable to plot on the X-axis.

07:01 What about the next question? Do older homes sell for less than newer ones of comparable size and quality? Well, here we're interested in how the sale price of a home changes with its age, with all of the things being about equal.

07:16 Here, the age of the house is the explanatory variable.

07:20 It would go on the X-axis. The sale price is the response variable.

07:25 That would be the one that goes on the Y-axis. What about the third question? Do students who score higher on their SAT have higher grade point averages in college? Well, here we're interested in seeing GPA changes with variations in the SAT score.

07:40 Thus, in this case, the SAT score's the explanatory variable and the response variable is the college GPA.

07:48 Now, we have a formal measurement of the strength of a linear association between two quantitative variables.

07:57 We call this measure correlation.

07:59 Formally, what correlation does is it measures the strength of the linear association between two quantitative variables.

08:05 Notice how linears involve Correlation applies to linear relationships; curved relationships.

08:12 For curved relationships, Correlation is not an appropriate measure of the strength.

08:17 Correlation also tells you the direction of the linear relationship.

08:21 Correlation has the property that it takes a value between -1 and 1.

08:28 Now, the question is, once we have that number, how do we interpret it? Well, the strength of the linear association between two variables is given by the absolute value of the correlation.

08:39 Then if 2 variables have a correlation of -0.85, their linear relationship is just as strong as the linear relationship between two variables whose correlation is +0.85.

08:52 The sign of the correlation is what tells us the direction.

08:56 A negative correlation means a negative relationship and a positive correlation means a positive direction. How do we calculate correlation? Well, first of all, we need to know that correlation has no units.

09:09 To ensure that units don't matter, we have to standardize each X and Y value.

09:14 What we do is we calculate the Z scores for each X and the Z scores for each corresponding Y.

09:21 We also calculate the sample standard deviations of the Xs and the sample standard deviations of the Ys.

09:27 Once we have the Z scores for each X and for each Y, we take the product of the Z scores for each X and the corresponding Y and we add up these products.

09:37 We then divide this by N minus one where N is the number of observations.

09:42 Okay, so mathematically, here's the formula.

09:45 R is equal to this sum. We basically take the sum of the products of the Z scores divided by N minus one.

09:52 R is the correlation. Let's do an example of where we calculate Correlation.

09:57 Okay, so using the data from before with age as the explanatory variable and mileage as the response variable, we're gonna get the following ZX-ZY pairs.

10:06 We'll just do it for Z-one or for the first observation and then we'll just list the rest of them.

10:14 We know that the sample mean for the Xs is 2.69, the sample mean for the Ys is 29.33, the sample standard deviations for the Xs is 1.46, and the sample standard deviations for the Ys is 20.471.

10:31 We calculate the Z score for X-one which is one minus 2.69 over the standard deviation of the Xs and the Z score for Y-one which is 10.5 minus 29.33; the mean divided by the standard deviation.

10:46 We get the Z score or combination of minus 1.1598 minus 0.9198.

10:53 We do the same thing for the other 9 pairs.

10:56 Down below you see the list of all the pairs of Z scores that we have from our ten observations.

11:03 Let's finish it. We have the Z scores, so now, we can find the product of the Z scores.

11:09 We get 1.0688 by multiplying the first two Z scores and then the remaining nine are given in green at the end.

11:18 We add up all those products and what we get is 8.634. Note that that's positive.

11:26 That's important. Now, we have to divide by the number of observations -1 We divide by nine and that gives us a correlation of 0.9593. What that tells us then, the Correlation between mileage and age is 0.9593.

11:44 What this indicates is a strong linear relationship and a positive direction between the age of a car and its mileage.

11:50 Before we use correlation, we have to make sure that it's an appropriate measure of the strength of the relationship between two variables.

11:57 Our data have to satisfy three conditions in order for Correlation to be helpful.

12:02 First of all, we have to make sure that our variables are quantitative.

12:06 Secondly, we have to satisfy the straight enough condition.

12:10 In other words, if we look at a scatterplot, the relationship between the two variables should be roughly linear.

12:16 This is a judgment call but it's usually a fairly easy one to make.

12:20 We also don't want any Outliers.

12:23 If we look at a scatterplot and we see a pair of Outliers, this can really mess up our correlation.

12:29 This can make a weak linear association appear strong or it can make a strong linear association appear to be fairly weak.

12:37 In general, outliers can really distort what the Correlation is and give false impressions about the strength of the linear relationship between two quantitative variables.

12:46 There are certain properties that come with correlation.

12:50 We can learn a lot from it. There are seven key characteristics of a correlation.

12:55 The first, which we've already addressed, is that the sign of the correlation gives the direction of the association between the two variables.

13:03 We also mentioned that correlation is always between -1 and 1 It can be exactly -1 or +1 which would indicate that all of the data fall on a single, perfect, straight line.

13:16 Correlation is a symmetric quantity.

13:20 In other words, the correlation between X and Y is exactly the same as the correlation between Y and X.

13:26 Correlation has no units.

13:29 There's no units attached to correlation, it's just a number.

13:33 Since correlation has no units, it's not affected by a change in the scale or center of either variable.

13:41 Correlation measures the strength of a linear association between two variables.

13:47 Thus, two variables might be strongly associated but may have small correlation if the association is not linear.

13:55 Correlation is highly sensitive to outliers.

13:59 A single Outlier can make a small Correlation large or a large correlation quite small.

14:04 One thing that we wanna really be careful of is associating correlation and cause.

14:12 These are not the same thing. We ask ourselves: Does a high correlation mean that a change in one variable causes a change in the other? We have a tendency of saying that.

14:23 Since because two variables are highly correlated or very closely related, it seems to indicate to us that the change in one causes the change in the other.

14:34 This isn't necessarily true.

14:36 For instance, here's an example.

14:39 If we look at house fires, there's a positive correlation between the number of firefighters at the scene and the amount of damage it was done.

14:46 That means that the more firefighters we have, the more damage tends to be done.

14:53 Does that mean that more firefighters cause more damage? Does this mean that because there's more damage with more firefighters we shouldn't call the fire department? Now, I mean that they do cause some, certainly with more firefighters, there's more water and they're chopping more holes but the size of the fire certainly impacts the number of firefighters that are called and the amount of damage.

15:15 This is what we call, the size of the fire is what we would call a lurking variable.

15:20 Let's look at what lurking variables are.

15:22 Those are the kinds of variables that hide out in the background and we don't see them.

15:27 A lurking variable is a hidden variable that stands behind the relationship between the two quantitative variables that we're interested in and determines that relationship by simultaneously affecting both of the variables.

15:40 In our example, the size of the fire is a lurking variable.

15:44 This can have huge effects on the relationships between two variables.

15:49 In many cases, the cause of the relationship between two quantitative variables is the product of a lurking variable and not necessarily the explanatory variable affecting the value of the response.

16:02 This is why we can't necessarily say that a high correlation is an indication of a cause of relationship between the explanatory variable and the response.

16:12 In using correlation, there are a bunch of mistakes that we can often make.

16:17 We wanna be careful of that. First, do not say correlation when you mean association.

16:23 Those aren't the same thing.

16:24 Association means a relationship.

16:27 Correlation is a numerical measure that quantifies the strength of a linear relationship between two variables.

16:33 Correlation quantifies the strength of the linear association where association is a relationship.

16:40 Do not try to find the correlation between two categorical variables.

16:45 It doesn't make any sense to do that. Don't confuse correlation with causation.

16:51 This is one of the biggest issues that we run into with Correlation and it's the one that we just addressed.

16:56 Make sure that the association between your variables is a linear relationship or a linear association before you try to use correlation to describe the strength of the relationship.

17:07 Do not assume that the relationship is linear simply because the correlation is high.

17:12 We might have a non-linear relationship and that an outlier that makes the correlation high, for instance.

17:17 Also, don't assume that it's non-linear because the correlation is low. Outliers can also cause that problem.

17:24 Outliers can have a significant impact on a correlation and, in fact, they often do.

17:30 That's why it's best to not just rely on the correlation to assess the relationship between two quantitative variables but you also wanna look at the scatterplot to see if there are outliers affecting things.

17:43 These are the common issues when using correlation.

17:46 We wanna be careful to avoid these.

17:48 This is the end of Lecture 6 about scatterplots and correlation and we'll see you next time for Lecture 7.

About the Lecture

The lecture Standardizing Data and the Normal Distribution Part 2 by David Spade, PhD is from the course Statistics Part 1. It contains the following chapters:

Scatterplots and Correlation
Making a Scatterplot
Choosing X and Y
What is Correlation?
Finishing the Calculation
Correlation and Causation

Included Quiz Questions

What information can we gain from a scatterplot?

We can determine the form of a relationship between two quantitative variables.
We can determine the form of a relationship between two categorical variables.
We can determine the correlation between two quantitative variables.
We can determine the strength of a relationship between two categorical variables.
We can determine the slippage between two quantitative variables.

If two quantitative variables are deemed to have a positive relationship, what does that mean?

This means that as the values of the X variable increase, the values of the Y variable increase.
This means that as the value of the X variable increases, the value of the Y variable decreases.
This means that as the value of the X variable decreases, the value of the Y variable increases.
This means that the values of X and Y are all positive.
The values of X and Y are all positive or zero.

Which of the following is NOT true about correlation?

Correlation measures the strength of a linear relationship between two categorical variables.
Correlation measures the strength of a linear relationship between two quantitative variables.
A correlation coefficient takes values between -1 and 1.
Correlation indicates the direction of a linear relationship between two quantitative variables.
A correlation coefficient may take on the value of 0.

Which of the following is a TRUE statement?

Correlation does not have a unit of measurement.
Two quantitative variables with correlation 0.6 have a stronger linear relationship than two quantitative variables with correlation -0.6.
If two quantitative variables are highly correlated, it can be concluded that changing the value of the explanatory variable causes the change in the response variable.
Outliers have little effect on the correlation.
Two quantitative variables with correlation 0.6 have a stronger linear relationship than two quantitative variables with correlation -0.8.

Under which of the following situations is correlation an appropriate measure of the strength of the relationship between two variables?

Correlation is appropriate when measuring the strength of a relationship between two quantitative variables that appear to be linearly related and have no outliers present.
Correlation is appropriate when measuring the strength of the relationship between two categorical variables.
Correlation is appropriate when measuring the strength of a relationship between two quantitative variables that appear to be linearly related and have several outliers present.
Correlation is appropriate for measuring the strength of the relationship between two quantitative variables when the relationship appears nonlinear.
Correlation is appropriate when measuring the strength of a relationship between two quantitative variables that appear to have a logarithmic relationship.

Which of the following numbers is likely to represent the correlation coefficient of two variables having a strong positive association?

0.89
-1.3
-0.89
0
1.5

Which of the following numbers is likely to represent the correlation coefficient of two variables having no association?

0
-1.3
-0.89
0.89
1.5

Which of the following numbers is likely to represent the correlation coefficient of two variables whose scatterplot is downward sloping?

-0.89
-1.3
0
0.89
1.5

Which of the following numbers is likely to represent the correlation coefficient of two variables whose scatterplot is upward sloping?

0.45
-1.1
-0.45
0
1.7

What is one characteristic to evaluate in a scatterplot?

Direction
Liquidity
Height
Width
Z-axis

Author of lecture Standardizing Data and the Normal Distribution Part 2

David Spade, PhD

Customer reviews

(1)
5,0 of 5 stars

5 Stars		1
4 Stars		0
3 Stars		0
2 Stars		0
1 Star		0

Standardizing Data and the Normal Distribution Part 2

By Lourdes K. on 20. October 2017 for Standardizing Data and the Normal Distribution Part 2

I learned a lot with this lecture. Really, I like the way of explanation. I will recommend this lecture to everybody who is really wanted to study statistics.

Playlist

Show Playlist

Hide Playlist