Hello, welcome to statistics 2.
Statistical inference and data analysis.
I'm David Spade and I'll be instructing this course.
We're gonna begin with sampling
distributions for proportions and means.
So what are sampling distributions?
We'll let's start with the basics.
For each sample we'll draw,
we'll calculate some statistic.
Each of these statistics is in itself, a random
variable and as such has a probability distribution.
This comes from the fact that the statistic
calculated from different samples differs
and we have to be able to model this variability.
This variation is termed sampling
error or sampling variability.
We'll discuss two types of sampling
distributions in this lecture.
The sampling distribution of a proportion
and the sampling distribution of a mean.
So let's start with the sampling
distribution of a proportion.
Suppose we survey every possible sample of 1007 US adults
and ask them whether or not they believe in evolution.
What can happen?
On one survey, 43% might say
they believe in evolution.
While on another, we might have 47% and on
another, we might have 42 and so forth.
Each sample is going to produce
a different sample proportion.
So now the question becomes,
how can we describe the variation
from sample to sample?
We can make use of what's called the
sampling distribution of the proportion.
And this is designed to model what we would see,
if we could actually see the proportions
of all these samples of the same size.
The sampling distribution shows how the
proportion varies from sample to sample.
We can use the normal model to describe
the distribution of a sample proportion.
How do we do it?
Well in any situation in which a normal model is
useful, the key pieces of information are the mean
and the standard deviation of the
variable that's being modeled.
Here, let's let p hat denote the sample proportion and
let p be the actual proportion in the population.
In other words, we're looking at the relative
frequencies of successes in our sample
and a group of independent Bernoulli trials
and we're looking at the relative
frequency of success in the population
So let's let n denote the sample size.
Then what we can find is that the expected
value of the sample proportion is just p
and the standard deviation of the sample proportion
is the square root of p times 1 minus p over n.
Knowing this information, we can say that as
long as the sample size is reasonably large,
the sample proportion follows a
normal distribution with mean p
and standard deviation square root of p
times 1 minus p divided by the sample size.
When is it appropriate to use the normal model
for the sampling distribution of the proportion?
Well, we need to make sure that we
have three conditions satisfied
in order to use the normal
model for sample proportions.
The first condition is the
That states that the individuals in the
sample must be independent of each other.
Two, the 10% condition.
We need the sample size to be less
than 10% of the population size.
And three, the suceess/failure
condition, we've seen this before.
Just like in the normal approximation
to the binomial distribution,
we need n times p at least ten and
n times one minus p at least ten.
So let's look at an example.
Suppose we know that 45% of the
population believes in evolution.
We randomly sampled 100 people and asked
them if they believe in evolution.
What is the probability that at least 47%
percent of the respondents in this sample
say that they believe in evolution?
First of all, we need to ask ourselves:
Is the normal model appropriate?
Well, the respondents are randomly sampled, so
it seems reasonable to assume independence.
Two, the 10% condition.
100 is much less than 10%
of the total population.
And three, the success/failure condition.
We have n times p is one hundred times 0.45 or
45 which is bigger than 10, so we're good there.
n times 1 minus p is equal to 100 times
0.55 which is 55 - which is bigger than 10,
So we are good there.
So we have all three conditions satisfied
and now we can use the normal
model for the sample proportion.
So let's continue.
We need to find the mean and we need
to find the standard deviation.
We can find that the expected
value of the sample proportion
is just the population proportion or 0.45.
We find that the standard deviation
of the sample proportion
is the square root of 0.45 times
0.55 divided by 100 or 0.0497.
So let's find the probability of observing
the sample proportion that's at least 0.47.
So what we're interested in is the
probability that p hat is at least 0.47.
So we can translate that
to a z score and we can,
by subtracting off the mean and
dividing by the standard deviation.
So we're looking at the probability that z is
at least 0.47 minus 0.45 divided by 0.0497
or the probability that z is at least 0.4.
So that's the probability that z is at least 0 minus
the probability that z is betwen 0 and 0.4.
And so we get 0.5 minus 0.1554 or 0.3446.
So that's where it looks like.
Here's the picture of the normal curve
with the area shaded in that we're
interested in, that's the green area.
And we see that that
probability is not real small.
Now we can move on to the sampling distribution
of a mean and modeling sample averages.
The idea here is similar to the
idea for sample proportions.
Suppose for instance that we gathered one thousand
randomly selected apples and weighed them.
And suppose we get an average
weight of 0.35 pounds.
We gather 1000 more randomly selected apples
and we get an average weight of 0.31 pounds.
If we take all the possible samples of size 1000
of apples, we would get different average weights.
This is the sampling variability in the
mean and we need to be able to model that.
So let's look at the fundamental theorem of
statistics, also known as the central limit theorem.
What this states is that the sampling distribution of any
mean becomes nearly normal as the sample size grows.
All we need is for the observations to be
collected independently and with randomization.
We don't even care about the
As long as the sample size is large enough, it doesn't matter
whether the population distributrion is symmetric or skewed.
The sample mean has an approximate normal
distribution and we can use the normal model
to describe the distribution of the sample mean.
So when can we use the central limit theorem?
Well just like we had for the sampling
distribution of the proportion,
we have certain conditions that need to be satisfied
in order to use the central limit theorem
to find the sampling distribution of the mean.
First, we need independent groups.
so the sampled values must be
independent of each other.
Second, we need to have
a large enough sample.
In other words, we need the sample
size to be reasonably large,
typically that threshold is at least 30.
And again, we have a 10% condition where the sample size
has to be no larger than 10% of the population size.
If these conditions are satisfied, let the data come from a
distribution that has mean mu and standard deviation sigma.
Then if x bar is the sample mean
and n is the sample size,
then x bar follows an approximately normal distribution with
mean mu and standard deviation sigma over the square root of n
So let's do an example of it.
Suppose we take a random sample of a 100 apples from
an orchard where there are over 10,000 apples
and the average weight of an apple there is 0.34
pounds with a standard deviation of 0.1 pounds.
What's the probability that the average weight
of our 100 apples is less than 0.32 pounds?
Before we do anything, we
need to check the conditions.
So let's start with independence.
The apples are randomly sampled, so independence
seems like a reasonable assumption here.
The sample size condition.
Well our sample size is 100 which is
much larger than 30, so we're okay here.
And the 10% condition,
One hundred apples is much less than 10% of
the population of apples in the orchard.
So we're okay there.
All these conditions are satisfied and
so we can use the central limit theorem.
Alright, so let's do it.
If x bar is the average weight of an
apple, then by the central limit theorem,
x bar follows an approximately normal
distribution with mean mu equals 0.34 pounds
and standard deviation sigma over the
square root of 100 or 0.01 pounds.
So what we want is the probability that x bar is less
than or equal to 0.32, so we translate that to a z-score.
and we get the probability that z is
less than or equal to 0.32 minus 0.34,
observation minus mean, divided
by 0.01, the standard deviation.
Or in other words, the probability is
that z is less than or equal to minus 2
or z is a normal (0,1) random variable.
this is equal to the probability that
z is bigger than or equal to zero
minus the probability that zero is less than
or equal to z, is less than or equal to two,
so we get 0.5 minus 0.4772 or 0.0228
so the following picture is gonna
illustrate the area under the normal curve.
See there's not very much of it.
So what that tells us is that observing an average
for a hundred apples of 0.32 pounds or less
is not that common when the
mean is actually 0.34 pounds.
So let's talk a little bit more about variation.
Means vary less than individual values.
This is intuitive as groups get larger,
their averages should be more stable.
Larger groups are typically more
representative of the population as a whole.
Their averages should be pretty stable
around the true population average.
It's far more likely that you're gonna
get a strange individual observation
than it is to get a strange average
of a thousand observations.
In other words, it's more likely to get
an apple that weighs less than 0.2 pounds
than it is to get a thousand apples whose
average weight is less than 0.2 pounds.
This is illustrated by the central limit
theorem, where as the sample size increases,
the standard deviation of
the sample mean goes down
cause remember, it's sigma divided by
the square root of the sample size.
So as the sample size increases, the
standard deviation of the mean gets smaller.
Let's look at a quick example.
What if we collected 400 apples from
the orchard instead of a hundred?
Our standard deviation goes from 0.01 to
0.1 over the square root of 400 or 0.005.
So our standard deviation got smaller
when we took a larger sample size.
And this is always going
to happen with means.
What can go wrong with
Well first, don't confuse the sampling
distribution with the distribution of the sample.
The sampling distribution is the distribution
of the statistic that comes from the sample.
Beware of observations
that are not independent.
The central limit theorem does not apply
to observations that are not independent.
Be careful of small samples
especially from skewed distributions.
As the sample size increases, the central
limit theorem in the normal approximations
work well for data from any distribution.
However, if we have a population
distribution that is really far from normal
and a small sample from this population, then the
normal approximation is gonna perform very poorly.
So, what have we done?
Well we talked about what a
sampling distribution is
and we talked about the sampling distributions of
both the sample proportion and the sample mean.
We did a couple of examples, one of each.
And then we describe the
things that can go wrong
with the sampling distributions
of the proportion and the mean.
These sampling distributions are gonna be very
important throughout the rest of this course.
So make sure that you're familiar with them and
very comfortable using them as we move forward.
This is the end of lecture 1 and I look
forward to seeing you back for lecture 2.