Welcome to Lecture 3, where we're gonna discuss Summarizing Quantitative Variables.
We often use graphical displays to summarize the distribution of a quantitative variable,
but first, we need to recall what a quantitative variable is.
So remember that a quantitative variable is a variable whose value have some numerical meaning.
And we have several different types of graphs and charts that we use to display the distribution of a quantitative variable.
We use histograms, steam-and-leaf plots, box plots,
and now we're gonna look at some of these types of graphs.
The histogram gives us a good idea of what the shape of the distribution of a quantitative variable is.
So when we make a histogram, it's basically the quantitative counterpart to the categorical bar chart.
What we do is we slice up all the possible values, the whole range of possible values for a quantitative variable
into equal-width intervals, which we call bins.
We count the number of observations that fall into each bin.
And we make a bar whose height corresponds to the number of observations that fall into that bin.
And once we've done this, we've created what we call a histogram.
So here's an example just for a fake data set.
Suppose we have the following measurements for a quantitative variable.
And there they are, in that first bullet. What we've done on the right-hand side of the screen,
is we've made a histogram of these data values.
So what does the histogram tell us?
Well, the histogram tells us that we have one value that lies between 10 and 14.
We have 6 values that are between 16 and 20, 14 that are between 21 and 25, and 4 that fall between 26 and 30.
And looking at the data, if we just try to check that out for ourselves, the data would verify that.
One important thing to notice that's different between a bar chart and a histogram,
is that there are no gaps between the bars of a histogram.
Sometimes we have quantitative data that can take any value in a particular range, say 25 to 30.
And so, we need to be able to include values such as 25.99 and 29.9745, for example.
So we have to have no gaps between the bars.
We can also represent quantitative variables using a relative frequency histogram.
And this is the same thing as a histogram, except now we're displaying counts as percentages of the whole.
This is the same ideas we used before with the relative frequency bar chart.
So here's the relative frequency histogram for the data set that we just looked at,
and note that it looks the same except the values
on the Y-axis are all percentages or proportions, and not frequencies.
Now, we look at stem-and-leaf plots, which are a lot like histograms, but they show all of the data values.
So how do histograms and stem-and-leaf plots differ?
Well, histogram summarize our quantitative data well, but they don't actually give the actual values of the data.
Stem-and-leaf plots do show the individual values.
So what we do is we put the first number on the left-hand side,
we draw a vertical line, and then we put the second numbers on the right-hand side of the line.
So here's an example for our data, we have 1, 1, 2, 2 down the left side which represents 10, 10, 20, 20.
And then we have the ones digits on the right-hand side of the line.
So 1 line 2 is 12, 1 line 6 is 16, 1 line 7 is 17, and so forth.
How are stem-and-leaf plots similar to histograms?
Well, if we look at the stem-and-leaf plot, and we flip it just to the left 90 degrees.
What we would get is something that looks a lot like our histogram.
Stem-and-leaf plots have two big advantages over histograms.
First off, they're easier to make by hand.
And second, they still show the distribution of a quantitative variable.
But one problem that we run into in stem-and-leaf plots is that
if we have a large number of observations, the creation of a stem-and-leaf plot can take a lot of time.
Often, when we look at quantitative variables,
we look at certain characteristics of their distribution.
The first thing that we typically look for is the shape of the distribution.
For instance, is it symmetric?
And what that means is if you made a histogram can you fold the histogram
along the vertical line through the middle,
and have the sides match pretty closely? Is it skewed to the right?
In other words, is there a large cluster of data on the left-hand side with long tail going out to the right?
Is it skewed left, where there's a large cluster of data out to the right,
and then there's a tail that comes out to the left?
We also, when we talk about shape, we look at the number of modes.
We ask how many peaks are in the histogram.
So if there's just one peak, we say that it's unimodal.
If there are two peaks, or two modes, we call it bimodal.
And then we get a little lazy after that, and if there are more than two modes, we just call it multimodal.
Alright, so let's look at the previous example, where we made the histogram,
and try to describe the shape of the distribution based on the new information that we have.
Is it symmetric? Well, it doesn't look like it is, in fact,
it appears to be a lot of data out to the right-hand side with a thin tail to the left.
So this distribution be skewed to the left.
How many modes do we have?
Well, it appears that there is one peak, and it's between -- in a bar between 20 and 25,
so the distribution does appear to be unimodal.
Here's just a picture of what a right-skewed distribution looks like.
Note that we have a whole bunch of data on the left-hand side,
and then a thin tail going out to the right.
Here's an example of what a symmetric distribution looks like.
We could fold, we could draw a line right between -- right down the middle of the bar between 6 and 8,
and then fold it in half, and it would -- the sides would match exactly, so this is a symmetric distribution.
And here's what a bimodal distribution, we have one peak --
looks like we have one peak to the left and one peak to the right, and then a little value in the middle.
So we have two modes there. And we would say that this distribution is bimodal.
So the second thing that we look for when we try to characterize the quantitative distribution,
is where it's centered. Where is this distribution located?
And one measure that we typically use just to answer this question is the median.
So how do we find the median?
Well, in order to find the median, what we do is we put -- we sort all of our data from lowest to highest,
and then we pick out the value that falls in the middle.
If we have an odd number of observations, then we just pick the one in the middle.
If we have a total of n observations, then we just take n plus 1 over 2 observation once they're all sorted.
If our sample size is even, then we just take the average of the two values in the middle.
So the sample size over two, and the sample size over two plus one.
We take those two observations and average them together,
and that gives us our median when we have an even number of observations.
So let's look at an example, where we do the -- we use the same data set that we used to create our histogram.
We have 25 observations, so it's odd, and we just pick out the one in the middle.
So the one that we look at once we sort everything out, is the 13th observation, 25 plus 1 over two.
Our 13th order observation is 22. So the median of our data set is 22.
The third thing that we look for is the spread.
How spread out are the values in our sample? In other words, how much do our observations vary?
One way to do this is to look at the range
where we take the largest observation minus the smallest observation.
However, there's a problem with doing this, because if we -- if we have extreme values,
then we can really have a big range, but our data might not actually be spread out that far.
For instance, in our original data set, if we entered 299 instead of 29,
then our data -- our range would go from 29 minus 12 which is 17,
all the way up to 287, 299 minus 12. So the range is highly sensitive to extreme values.
So it's not necessarily the best measure of spread to use.
So we can improve on that, instead of looking at the extreme values of our sample,
we take the ones in the middle, and we look at the range between those middle values.
So what do we mean by that? Well, we divide the data -- the ordered data in half at the median,
and then we look at the median of the lower half and the median of the upper half.
And in finding the median for each half, we don't include the overall median.
So then, what we have is 1/4 of the data line below the median of the lower half, which we call the lower quartile.
We have 1/4 of the data lying above the upper -- the median of the upper half, or the upper quartile.
And that means that half of the data fall between those two values.
So when we define the interquartile range, what we mean is the upper quartile minus the lower quartile.
So it's the -- basically the range of the middle 50% of our data.
So let's look at an example using the data set that we had from before.
If we look at the lower half of our data set, excluding the median, we have 12 observations.
So we look at the average of the 6th and 7th values.
So our lower quartile is 1/2 times 19 plus 20, or 19.5.
If we look at the upper half of our data, while excluding the median,
then what we have -- we also have 12 observations there, so the median,
is again, the average of the 6th and 7th values.
And when we average those two together, we get an upper quartile of 25.
So our interquartile range then, is 25 minus 19.5, or 5.5.
Now note that if we had -- if we miss any of the maximum value -- if we miss any or 29,
it wouldn't affect the interquartile range.
So this is a lot less sensitive to extreme values than what the range is.
We use all the values that we've just found to calculate what's known as a five number summary,
and this is basically a concise description of the spread and center.
So we summarize our data set with five numbers.
The minimum or the lowest observation, the lower quartile or Q1,
the median, the upper quartile or q3, and the maximum.
And these values are what's used to make what we call a box plot of our data set.
So this is another way to summarize data graphically.
And so we display the five-number summary with a box plot.
And here are the steps to do it.
Okay, so we draw a single vertical axis that spans the entire range of the data.
And then we draw a horizontal lines at the lower and upper quartiles.
And then we draw a vertical lines connecting those -- those horizontal lines to make a box.
So there's a box that has exactly the interquartile range as its width.
Inside that box, we draw a line to represent the median,
and then we make fences around the data.
Our upper fence is the upper quartile plus 1.5 times the interquartile range.
And the lower fence is the lower quartile minus 1.5 times the interquartile range.
But for now, we're not gonna include those fences in our boxplot, we'll talk about that a little bit later.
So once we have the box, we grew whiskers out of it, one going up and one going down.
So one whisker connects the upper edge of the box to the maximum value,
the other whisker connects the lower edge of the box to the minimum value.
If a data point falls below the lower fence that we calculated,
then we connect the whisker to a horizontal line from the lower quartile that marks the lower fence.
And any points that fall below that are marked with dots.
If a data point falls above the upper fence, then we connect the upper whisker
from the upper quartile to a horizontal line that marks the upper fence,
and any points above that upper fence are also marked with dots.
So here's an example of a boxplot. We first give the five-number summary.
The minimum is 12, Q1, we found was 19.5. We found the median to be 22,
the third quartile to be 25, and the maximum to be 29.
The interquartile range is 5.5, we found that earlier.
And so the fences are 19.5 minus 1.5 times 5.5, which is 11.25.
So we don't have any data points below our lower fence.
For our upper fence, we get 33.25, and we don't have any values that fall above that upper fence.
So we don't have any extreme values that we need to mark with dots.
All right. Now sometimes, we have symmetric distributions,
and we can summarize those a little bit differently.
Now, the median is going to work well to summarize the center of a distribution,
regardless of the shape. But some -- we have a couple of things that have some nice properties
that we can use to summarize symmetric distributions.
If the distribution is symmetric, we can summarize the center using the mean,
or the average value of our observations.
Supposed our data values are Y1 through Yn, then the mean is given
by just adding up all of our data values, and dividing by the total sample size.
And this gives an idea of basically where the histogram would balance,
where there would be equal weight on both sides of the mean.
We don't wanna use the mean to summarize skew distributions,
because the mean is really sensitive to extreme values, and to skewness.
Like that lower -- that tail is gonna pull the mean in its direction.
We can also summarize the spread differently for a symmetric distribution.
And we summarize this using what's called the standard deviation.
So we start by finding what's called a variance.
And to find the variance, what we do is we take each observation, subtract the mean,
and square that difference. And once we do that we add them all up,
and we divide them by 1 over our sample size minus 1.
This gives us what's known as the variance of our sample.
And then more commonly, we use the standard deviation to summarize spread,
and we take -- we get the standard deviation simply by taking the square root of the variance that we just found.
So for example, we use the data set that we've already had, and we calculate the mean.
We take 1/25, which is our sample size, times the sum of all of our observations, and we get 22 as our mean.
To get our standard deviation, we take the square root of the sum of each observation,
minus 22-squared, divided by the sample size minus 1.
So we get the standard deviation of 3.9791,
and this basically represents the average distance of an observation from the mean.
And for symmetric distributions this provides a useful numerical summary
of the center and spread of the data.
So let's summarize how we describe quantitative variables.
The important things to remember are first, when describing a quantitative variable,
you wanna be sure to include a description of the shape, the center,
and the spread of the distribution of that variable.
For symmetric distributions, you can use the mean
and the standard deviation as your measures of center and spread.
But if your distribution is not symmetric, then you wanna go with the median,
and the interquartile range to summarize the center and spread.
If the distribution is skewed, then the mean is skewed to the right --
then the mean will be larger than the median,
because the tail -- you wanna remember that the mean chases the tail,
so whichever direction the tail goes, that's the direction the mean is gonna go.
If the distribution is skewed left then, the mean will be smaller than the median,
because the tail is gonna pull the mean down towards it.
And in a symmetric distribution, the mean and the median should be pretty close to equal.
So let's think about some common issues to avoid.
Some things that can go wrong, and some common mistakes
that are made in the description of the distribution of a quantitative variable.
We don't wanna make histograms of categorical variables.
We wanna remember to sort the data before we find the median and the quartiles.
We wanna remember three rules of summarizing quantitative data.
And these three rules are, one, make a picture, two, make a picture, and three, make a picture.
Graphs gives us a lot of information --
a lot of useful information in one place about the distribution of a quantitative variable.
It's a useful way to get some sense of how the quantitative variable is distributed in the sample.
We wanna make sure that when we summarize center and spread,
we do it in the proper way, and by the proper way, that means what we know about the shape.
So we need to know the shape, whether it's skewed, and if it is skewed,
remember to use the mean and the interquartile range.
If it's symmetric, then you remember to use the mean and the standard deviation
to summarize the center and the spread.
And those are the common issues that we run into
when we describe the distributions of quantitative variables.
And that's the end of Lecture 3, and we'll see you back here for Lecture 4.