Welcome back for lecture 2.
What we're gonne discuss confidence
intervals for proportions.
So let's start with how we estimate
proportions using intervals.
As we discussed previously, if we
have several samples of the same size
and computed proportions of
successes in each sample,
we're going to get several
So this is why we have sampling distributions
but the question is, which answer is right?
And the answer to that is,
probably none of them.
So what do we do?
Instead of using one single estimate
to summarize the entire populaiton,
perhaps we could give a range of reasonable
values for the population proportion.
We can construct this interval in such a way
that we are guaranteed a certain probability
that the interval captures the true
value of the population parameter.
This interval is known as a confidence interval.
The primary benefit to using an
interval instead a single estimate
is that we no longer have to rely on that one
single value to estimate the population proportion
We now have a range of values that
are reasonable given our data.
So how do we create this
Well let's first recall the sampling
distribution of p hat - the sample proportion.
Suppose the p is the population
proportion, and that n is the sample size.
Provided the success/failure condition is satisfied
and all of our observations are independent,
we know the sampling distribution of p hat.
The distribution is normal, with mean p and standard
deviation square root of p times 1 minus p over n
So let's look at an example.
Let's go back to the Belief
in Evolution example.
We asked a hundred randomly selected
people if they believe in evolution.
In this setting, we don't know
the population proportion.
47 of the people in our sample, say
that they do believe in evolution.
We know our sample
proportion, p hat equals 0.47
In order to create a confidence interval for p - the
true proportion of people who believe in evolution
we need the sampling distribution of p hat.
So what's the problem?
Remember that the sampling distribution
depends on the population proportion.
We need to know p to get
the standard deviation.
But we don't know it, so how do we
find the standard deviation of p hat?
The answer is quite simple, we don't.
We use the estimate p hat to find the quantity
known as the standard error of p hat.
So instead of using p in the standard deviation
calculation, we simply find the standard error of p hat
by plugging in p hat for p.
So we have standard error of the sample proportion is the
square root of p hat times 1 minus p hat over the sample size.
So the question is, what about the mean?
What is it that we're trying to estimate?
It turns out that we don't need it in
confidence interval so that's a cool thing.
In this example, we get the standard
error of the sample proportion
is the square root of .47 times .53
over 100 which comes out as 0.0499
So what do we know?
Well we change the
success/failure condition just a bit,
and now we checked that n p hat is at least
10 and n 1 minus p hat is at least 10
Here we have the n p hat is 100 times 0.47 or 47 and
that n times 1 minus p hat is 100 times 0.53 or 53.
so the succes/failure
condition here is satisfied.
This tells us that the sampling distribution of p hat is
approximately normal with mean p and standard error 0.0499
This tells us that about 68%
of our sample of size 100
will have a sample proportion within one standard
error .0499 of the population proportion
This tells us again by the empirical rule that
about 95% of all of our sample of size 100
will have a sample proportion within 2 standard
errors or a 0.0998 of the population poroportion.
So how do we construct the interval?
Well we're trying to capture
the population proportion.
So from the view of the sample proportion,
we know that there's a 95% chance that
the population proportion is no more than 2 standard
deviations away from the sample proportion.
We can use this to our advantage and we do this
by simply adding and subtracting from p hat
the value 2 times the
standard error of p hat
and we have the endpoints of an interval that has a 95%
chance of capturing the true population proportion.
So in our example, we'll have p hat
plus or minus 2 times 0.0499
which gives us 0.47 plus or minus 0.0998
or an interval from 0.3702 to 0.05698
We have to be very cautious here.
Even if the interval does capture
the population proportion,
we still dont know the value
of the popukation proportion
We can't even be sure that the interval
contains the population proportion.
So what can we say about
the population proportion?
Well let's start with the things that we can't
apccurately say based on the confidence interval
What we cannot say is that 47 percent
of all US adults believe in evolution.
The sample proportion is almost certainly
not equal to the poputaion proportion.
It is probably true that 47% of all
US adults believe in evolution.
Again we can't say that,
it's probably not true
because the sample proportion is very unlikely
to be equal to the population proportion
We also can't say the following:
We don't know exactly what the proportion
of US adults is that believe in evolution
but we know that it is
between 37.02% and 56.9%8
No we don't.
We can't be sure that our interval
contains the true population proportion.
So what can we say?
Well one thing that we can say
is that we are 95% confident
that between 37.02% and 56.98% of US
adults believe in evolution
This is a a statement about that describes our confidence
interval and this is the best we can do.
This particular interval is known
as a one-proportion z-interval.
We'll see several other types of confidence
intervals a little bit later on in the course.
What do we mean by confidence?
Well what we mean is confidence in the
process and not necessarily the result.
So formally when we say 95% confidence,
we don't refer to a 95% chance
that our interval contains the
true population proportion.
The population proportion is a fixed quantity
and it's either in interval or it's not.
but we don't know the
answer to the question
What we mean is that 95%
of samples of this size
will produce confidence intervals that
capture the true population proportion.
So what we often say is we are 95% confident
that the true proportion lies in our interval.
The uncertainty comes in whether the particular
sample we have is one of the succesful ones
or one of the 5% that don't produce the interval
that captures the true population proportion.
So what we can envision is for each sample, let's just
draw a vertical line where the population proportion is.
And then we take a whole bunch of samples and we
calculate a 95% confidence interval based on each sample
and then lay the interval horizontally.
So we have this vertical line and we might have
one interval that's right around the line,
one interval that has the
line in it but barely,
we might have one way over on one
side that doesn't have line in it,
that would be one of the unsuccesful ones.
So basically, all the intervals that have the line going
through with it at any point with will be the succesful ones
and the intervals that don't have the line
going through it will be the unsuccesful ones.
What we would expect over the long run would
be for 5% of intervals we create
to be unsuccesful, in other words, there are 5% of
our intervals don't have the line going through them
Let's look at margin of error and the trade
off between confidence and precision.
The margin of error is simply the
halfway of our confidence interval
It's the extent of the interval on
either side of the sample proportion
If we want a higher a level of confidence,
we need a larger margin of error
Let's think about archery.
If you have a big target,
you're more confident that you're going to
hit that than you are with a small target.
It's the same idea with the confidence interval,
you are more confident that a bigger interval
is going to contain the true population
proportion than you are for a small interval.
So as a result, a smaller margin of error
is associated with less confidence.
SO ther eis a trade-off between confidence and
precision in the sense that higher confidence
means less precision.
So how do we change the confidence level?
Well, we find critical values.
In order to change the confidence level, what we
need to do is change the number of standard errors
we want to extend the interval
away from the sample proportion.
There's a number of standard errors
that's known as the critical value.
So how do we find them?
Once you've selected a confidence level, you
can use the z-table to find the critical value
which we're gonna denote as z star (z*)
For 95% confidence interval, the precise
critical value is z* equals 1.96
For a 90% confidence interval, the
precise critical value is 1.645
How do we come up with these?
Well we just look in the normal table
So it's usually normal distribution
to find critical values.
In finding the critical values
for a 95% confidence interval,
the aim is to find two values
between which lie 95% of the values
This would mean that there's
2.5% left out in each tail
so using the z-table you'd look
up the probability 0.9750, why?
Well the z- table gives you the probablity
of the normal (0,1) random variable
takes the value
less than or equal to z*
So if the probablity that the normal (0,1) random
variable is greater than or equal to z* 0.25,
then the probability that that normal 01 random
variable is less than or equal to z* is 0.975
So when we look at the body of the table, we
find .975 and we find that z* is equal to 1.96
So here's a picture of what the critical region looks like
for the normal distribution for this confidence interval
We just take the middle 95%
of the normal distribution
We do a similar thing for
a 90% confidence interval
where we just look at the middle
90% of the normal distribution.
In order to use the one proportion z interval,
we have to have four important conditions
that need to be satisfied in order
for the process to work well.
First of all, we need all of our trials to be independence
of each other, this is the independent assumption
Second, we need randomization.
In other words, the data need to
be sampled or generated at random.
This can help ensure independence.
The 10% condition.
The sample size shouldn't be greater
than 10% of the population.
And finally, number four - the
where we observe more than 10
successes and more than 10 failures.
In the evolution example, all these conditions are
satisfied since the adults were chosen randomly,
and 100 is far less than
10% of all US adults.
We verify the success/failure
How do we choose a sample size?
Remember that we need to beat the trade
off between confidence and precision.
and there's only one way to increase confidence
while maintaining the same level of precision
and that's to choose a larger sample size
So maybe we want to choose a sample size that gives us a
certain confidence level with a specified precision.
The margin of error is given by: the critical z times
the square root of p hat 1 minus p hat over n
So we can use algebra to find the desired sample
size needed to obtain a particular margin of error
This is typically done before any analysis is
carried out so that at that point,
the sample proportion is unknown.
In order to be conservative, we need to have a
margin of error as large as possible.
This is done by substituting .5 for p hat.
And in doing this with the algebra,
we get to the desired sample size is
n equals z* times .05 over the
margin of error, that whole thing, squared.
It is conservative because the margin of error is
maximized when p dash a j t equals one-half equals .5
That way, the n we get will work in
giving us the desired margin of error
regardless of what the
value of p dash a j t is
This is a worst-case scenario approach.
So let's try one.
So let's go back to the evolution example and the
goal here is to find the sample size as necessary
to obtain a margin of error
of 0.03 with 95% confidence.
So assume we haven't taken a sample yet.
So here's the calculation,
z* is still 1.96, the
margin of error is 0.03
so desired sample size is z* 1.96 times 0.5 all divided
by the margin of error .03, that whole quantity squared.
So that gives us the desired
sample size of 1067.11
But there's a problem, we can't
sample .11 people, so what do we do?
Well it's no big deal, all we do is in order to be
conservatve, we round up to the next whole number
so we would choose 1068 people
So what can go wrong?
Well here's a group of some of the
common issues with confidence intervals.
and here's some pitfalls to avoid
Do not suggest that the population
proportion varies, it does not.
It's a fixed quantity,
it doesn't move around.
Don't claim that other samples
will agree with yours.
Don't be certain about the parameter.
In statistics, we're "confident".
Statistics is all about
So in statistics we're not be certain
about anything - we're confident.
Don't forget that the point is about
estimating the population proportion.
Don't make confident statements about the
sample proportion, you know that one.
There's no need to estimate
something that you know.
Don't claim to know more than what your interval
tells you and treat the whole interval equally.
Values near the center of the interval are not
necessarily any more or any less plausible
than values near the edges.
Beware of a margin of error
that's too large to be useful
An interval of 10% to 90% for
instance, is not very helpful.
Watch out for biosampling techniques, and think
about whether or not your trials are independent
Alright so those are the common
pitfalls of confidence intervals.
In this lecture. what we talked about was just
constructing confidence intervals for proportions,
So we described how we do it
based on normal distribution,
we described why we do it, we talked
about the meaning of confidence
And then we talked about some of the
common issues with confidence intervals.
So this is the end of lecture 2 and I look
forward to seeing you back for lecture 3.