Welcome to Lecture 4, where we're gonna discuss comparing the distributions of two different quantitative variables.
The big picture here, is how do we describe these distributions?
When we talk about what the distribution looks like,
we need to use two key components described in the previous chapter.
Graphical displays, to get an idea of the shape.
Summary statistics, to get an idea of where the distribution centered and how spread out it is.
For example, if we were measuring average wind speed for each day through out the year
we would describe the distribution of wind speeds by saying,
"Here's a histogram, a 5-number summary, and a boxplot."
Suppose wind speeds are unimodal and skewed right, we would report the median and the interquartile range.
We can compare the distributions of multiple groups by looking at histograms of each group.
And what can this tell us?
Well, from this, we can learn how the shapes of the distributions differ.
We can learn whether or not, one appears to be more spread out that the other.
And we can also learn whether or not, one distribution appears to be centered in a different place than the other.
For instance we might ask, "Is it windier in the summer, than it is in the winter?"
And this question can be approached by looking at histograms of the wind speeds in the summer,
and histograms of the wind speed in the winter.
All right, so let's look at an example dealing building a nest egg.
And we're gonna look at two different regions of the country.
The nest egg index is a measure of savings for retirement for each of the 50 states.
And the states were divided into two regions, the South and the West, and the North and the Midwest.
And the question is, "How do the Nest Egg indices compare between the two regions?"
So we're gonna look at histograms for the nest egg indices between the two groups,
and we'll compare the shape, center, and spread for each of the groups.
So here's a histogram for the nest egg index for the South and West region,
and we see that it's fairly symmetric, and unimodal, and centered right around -- it appears to be around 97.
Here's a histogram of the nest egg index for the North and the Midwest region,
and we see that we have a bi-modal distribution that has a strong left skew
and appears to be centered around a hundred.
Here's the comparison of the two distributions.
For the South and West, like we saw, it's slightly skewed to the right, and unimodal.
But for the Northeast and Midwest, we have a strong left skew and a bi-modal distribution.
So that's what the histogram tells us.
We saw that the South and West distribution appears to be centered around 95,
and that the Northeast and Midwest distribution appears to be centered around 100 or 105.
We also saw that there appears to be more spread in the Northeast and Midwest region,
than there was in the South and the West region.
So let's compare these two distributions using numerical summaries.
So we can assess what we think about the center, and the spread of these distributions by using the summary statistics.
So our measures of center, and spread to compare them.
Both distributions are skewed, so it's best to report the median and interquartile range for each.
For the South and the West region, we found that the median is 97.
The median for the Northeast and the Midwest region is 103.
For the South and the West region, the interquartile range is 6.5,
whereas for the Northeast and Midwest region, the interquartile range is 8.5.
So it turns out that our interpretations of the histograms is pretty accurate
because the Northeast and Midwest region is more spread out, than the South and the West distribution.
And the centers of the two distributions are about where we thought they were based on the histograms.
We can also compare these groups using boxplots.
We put these boxplots side-by-side, and sometimes this gives a better picture of the differences between the two distributions.
It can give it some sense of how the shapes differ, in terms of the symmetry and the skewness.
But the problem with boxplots is if they hide modes,
so we won't be able to see that the South and West distribution is bi-modal from the boxplot.
It is extremely useful as a way to compare center, and spread between the two distributions.
So we place them side-by-side in the same plot.
So here's the example of the side-by-side boxplots.
We have, for the South and West region, that the center is, again, around 97,
and that there's less spread in the South and the West region than there is in the Northeast and the Midwest region.
And these things can be easily ascertained from looking at a boxplot.
The boxplot show two of the key features that we got from the histograms.
Its shows the differences in the center, and it shows the differences in the spread.
But we lose some of the information that we got from the histogram.
We can't see the two peaks in the Northeast and Midwest region.
But we can still see the left skew.
So the big picture here is that boxplots are great for determining whether there are differences between the center and spread,
but not so reliable in terms of determining shapes.
So when you're comparing distributions, it's best to look at both side-by-side histograms, and side-by-side boxplots.
What about outliers?
What are they, and how do they affect our description of the data?
When we talk about outliers, what we're talking about are observations
that are far away from the other values in the data set.
Basically, extreme values.
And there are several ways that these can occur, they can occur due to a data entry error, or other misreports of the data values.
But they might also be important, they might highlight exceptional cases.
Like in the South and Midwest region, if we're looking at wind speeds at this time.
Maybe on one day, there's a tornado and the wind speed is very high.
Well, that's important, that's not data entry error.
That's important information, so that would highlight in the exceptional case.
Boxplots help us see outliers, those are the dots that we put outside the fences if we have extreme observations.
Often, histograms can show outliers, and they're usually represented by bars that are kinda next to gaps in the histogram.
So here's an example of the South and West nest egg data with an outlier added.
We added an extra observation with the nest egg index of 150.
And so here's the original boxplot of the South and West region,
and then there's that dot way up at the 150 to symbolize the outlier.
Now, if you look at the outlier with the histogram, we see that there's that one observation way out to the right side,
far away from everything else in our data set.
So from the histogram, we can clearly see that this is an outlier.
Sometimes, we wanna make our data more symmetric,
so that we can deal with the mean and the standard deviation as center,
and spread measures as opposed to the median and the IQR.
And one way to do that is to transform our data set.
If the data are skewed, it can be difficult to summarize our distribution using center and spread.
And it can also be difficult to determine whether the most extreme values are outliers,
or if they're just part of the stretched out tail.
So transforming data is one way to get around this problem.
So for our data X1 through Xn, and their distribution is strongly skewed,
one way to make the distribution a little bit more symmetric is to use a logarithmic transformation,
so that what we're working with as our data values are actually natural logs of the data values that we actually observed.
So again, here's the histogram of the South and West data as we had them before.
And now we're gonna look at the histogram of the natural log of the South and West nest egg indices,
and we see that we have a little bit more symmetry here than what we did before.
What happened to the histograms?
We do have more symmetry after we do the transformation,
but transforming the data will not always fix the lack of symmetry.
If the data are extremely skewed, then the log transformation is unlikely to be very helpful.
But if you have slight skew then the log transformation can be useful in fixing that problem.
We have another common transformation which is is the square root transformation.
And so instead of working with our original data set,
what we're gonna work with is the square roots of our original observations.
What about equalizing the spread between two groups?
Maybe we don't want data that are -- that have one group more spread out than the other.
So there's a whole bunch of statistical procedures for comparing two groups that rely on the assumptions
that the variance for the first group is the same as the variance of the second group.
And this is one reason that we might wanna equalize the spread between the two groups.
The logarithmic transformation can often be used, also, to equalize the spread between the two groups
in order to use the procedures that rely on that assumption.
But other common transformations that we use for this,
is the square root transformation, and the inverse sine transformation.
But the inverse sine transformation is far less common that the square root,
because they both kinda do the same thing in terms of equalizing variance.
Okay, so here's a boxplot of log transformed data for the nest egg indices.
We have the South and West region on the left-hand side.
The Northeast and the Midwest region on the right-hand side.
And what do we see?
We see that the spread in the boxes is a lot more similar than what it was before.
The Northeast and Midwest region is still just a little bit more spread out
than as the South and West, but that difference is not as great as it was before.
All right, so when we compare distributions, we have several issues that we can run into.
One of the biggest problems that we run into is using inconsistent scales.
We always wanna compare our distributions for variables on the same scale.
In other words, don't transform one group, and not the other.
We wanna make sure that our plots are clearly labeled,
because a plot is of little use if it's unclear what information it's giving you.
If we don't know what information this plot is supposed to convey, then we might as well not have a plot at all.
So labeling the plot clearly fixes that problem.
This is very important, beware of outliers.
If the data have outliers, and you can correct them, then do it.
If they're clearly errors, then take them away.
But otherwise, summarize the data twice, do it once with the outlier in it, and once with the outliers removed.
And that's the end of Lecture 4, Comparing Distributions. And we'll see you back here for Lecture 5.