Welcome back for lecture 5 in which we're gonna
talk about how to compare two proportions.
So let's start with an example.
Psychologists suggests that men
take more risks than women.
They question the effect of men having a woman by
their side in reducing the risk taking behaviors
So for example, the examine seatbelt usage.
And the question is whether or not male
drivers wear their seatbelts more frequently
when there's a woman in the car with them.
So they randomly selected 4208 male
drivers with female passengers
and 2777 or 60% of them
were wearing their seatbelt
They also looked at a random sample of 2763
male drivers without women by their side
and 1363 or 49.3% of them
were wearing their seatbelts.
So the question is, does this suggest that male
drivers are more likely to wear their seatbelts
when a female is present in the car?
So what should we look for?
The first thing we need to do is we need to find
the statistic to investigate the difference.
So we know that if x and y are
indepedent random variables,
then the variance of the difference between x and y
is equal to the variance of x plus the variance of y
This means that the standard deviation
of x minus y is equal to the square root
of the variance of x plus the variance of y.
But remember that this only applies
to independent random variables.
So now we're dealing with two proportions.
Let's let p1 and p2 be the true proportions
in the independent populations.
We're interested in the difference
between the two, p1 minus p2
So how can we compare them?
Let's look at the standard deviation
of the difference of proportions
So we look at the difference
of sample proportions.
We let p hat 1 and p hat 2 be the corresponding
sample proportions for samples of size n1 and n2.
And let's recall that the variance of
p hat 1 is p1 times 1 minus p1 over n1
and that the variance of p hat 12
is p2 times 1 minus p2 over n2.
Since the samples are drawn independently,
the sample proportions are independent.
Threfore, we can look at the
variance of p hat 1 minus p hat 2
as the sum of the variances
of the two sample proportions.
Therefore the standard deviation
of p hat 1 minus p hat 2
is just the square root of p1 times 1 minus
p1 over n1 plus p2 times 1 minus p2 over n2
Usually we don't know the true population proportion
so we have to estimate them as we did before.
This means calculating a standard error
as opposed to a standard deviation
So we have that the standard
error of p hat 1 minus p hat 2
is the square root of p hat 1 times 1 minus p hat 1
over n1 plus p hat 2 times 1 minus p hat 2 over n2
In order to form confidence intervals
for the differnce in proportions,
we have to have some conditions
satisfied, and there are four of them.
First of all, we need the
groups to be independent.
In other words, the two groups that we are
comparing are independent of each other.
The samples from each population
are collected independently.
Two, the independent/randomization condition
- this is not the same thing as one.
Each groups observations also have to
be independent or randomly selected.
The 10% condition.
As before, the sample should not be larger
than 10% of their respective population sizes.
And four, the success/failure condition.
We observe at least 10 successes
and 10 failures in each group.
If these conditions are satisfied,
then we can form confidence intervals
for the difference between
the population proportions.
So let's think about how we do it
While the interval looks similar to the one from before,
we'll take p hat minus 1 minus p hat 2 this time,
plus or minus z* times the standard
error of p hat 1 minus p hat 2,
so we're taking a statistic from a
sample whose distribution we know,
plus or minus z* times the
standard error of that statistic.
This is something we've done before.
So writing it out in full, we'll
have p hat 1 minus p hat 2
plus or minus z* times the square
root of p hat 1 minus p hat 1 over n1
plus p hat 2 times 1 minus p hat 2 over n2
We can use this to investigate the question of whether
men are less likely not to wear their seatbelt
when a woman is in the car.
So let's do it, let's create a 95% confidence
interval for the difference in the proportions.
Let p1 be the percentage of men that wear
seatbelts when a woman is in the car
and let p2 be the percentage of men that wear
seatbelts when a woman is not in the car.
Let p hat 1 and p hat 2 be the
corresponding sample proportions.
So first we have to check the conditions.
Do we have independent groups?
Well the samples were taken independently
so this condition is satisfied.
Do we have independence/randomization?
The menwere randomly selected in each
group so condition two is satisfied.
Condition three, the 10% condtion.
Both samples are a very small percentage of a
large population so we're good on conditon three.
and the success/failure condition.
We observed 2777 successes in the first group
and 1363 successes in the second group
So, the success/failure condition is satisfied, we also
observed a sufficient number of failures in both groups.
So since all the conditions are satisfied, we can
use what we call the two-proportion z-interval
So let's do the mechanics
We observed p hat 1 equals 0.66,
we observed n1 equals 4208,
p hat 2 equals 0.493,
and n2 equals 2763.
The critical value for a 95% confidence
interval we've seen before, is z* equals 1.96
So using the formula for the confidence
interval that we've seen before,
we get .66 minus .493 plus or minus 1.96
times the square root of .66 times .34
over 4208 plus .493 times .507 over 2763
So this leads to an
interval of .15501 to .17899
So what we're saying here
is that we're 95% confident
that the proportion of men who wear seatbelts when a
woman is in the car is between .15501 and .17899,
higher than the same proportion
when no women are in the car.
Note that zero's not in the interval
This is important.
What this indicates is if we were to conduct a
hypothesis test of the null hypothesis p1 minus p2 equals 0
against the alternative hypothesis
p1 minus p2 is not equal to 0,
we would reject the H0 at
the 5% significance level.
Let's look at hypothesis testing for a difference in
poroportions and we'll motivate this one with an example as well.
In a national sleep foundation study, 995
people were asked whether they snore.
They were broken up by age category, we had 184 people
under the age of 30, of which 26.1% of them snored.
We also had 811 in the over 30
age group of which 39.2% snored.
So the question is, is there a diffference in the
percentage of people over and under 30 who snore?
and we'll need a hypothesis
test to answer this question.
So in this example, let's let p1 denote the
percentage of people 30 or younger that snore
and let's let p2 be the same
percentage in the over 30 age group.
We set the null hypothesis
H0 to be p1 minus p2 equals 0
So let's measure variability, which
means calculating a standard deviation
Under the null hypothesis, the
difference in proportions is zero.
But we don't know the individual
values so we need to estimate them.
We do this by what's called pooling the
proportions and using them to calculate
the standard error in the difference
in the two sample proportions.
This pooled estimator is named p hat pooled and it's
goven by n1 p hat 1 plus n2 p hat 2 over n1 plus n2
In other words, it's the number of successes in the first
group plus the number of successes in the second group
divided by the total number of observations.
Then we calcualate the standard
error in the usual way.
The pooled standard error of the p
hat 1 minus p hat 2 is given by
the square root of p hat pooled 1 minus p hat pooled over
n1 plus p hat pooled times 1 minus p hat pooled over n2
So in order to carry out the test, we need
to satisfy the success/failure condition.
We rely on the pooled sample proportion now because we don't
know the hypothesized population proportions individually
In order to use the following procedure, what
we need is n1 p hat pooled to be at least 10.
n1 times 1 minus p hat
pooled is at least 10,
n2 times p hat pooled is at least 10 and n2 times
1 minus p hat pooled is also at least 10.
If this condition holds along with the independence/randomization,
the independent groups and the 10% condition,
we can use a z-test.
The test statistic for this
hypothesis test is a z-statistic
and is given by p hat 1 minus p hat 2 minus 0 divided
by the pooled standard error of p hat 1 minus p hat 2
And we use this statistic
to calculate p-values.
So let's go back to the snoring example.
We wanna test the hypothesis p1 minus p2 is equal to 0
against the alternative that p1 minus p2 is not equal to 0
Alright, we've stated the hypothesis,
so now we have to check the conditions.
First, the independent groups assumption.
This is reasonable, since it's unlikely that what happens
in one groups affects what happens in the other group.
The individuals were randomly sampled
in each group so we're okay here.
The 10% condition, the sample sizes are much less
than 10% of the population size in each groups
so we're okay with three.
For the success/failure condition, we
need p hat pooled in order to do it.
So let's find it,
P hat pooled is 184 times
.261 plus 811 times .392
divided by 184 plus 811 which gives us
a pooled sample proportion of .3677
We have n1 p hat pooled equals 67.67
n1 times 1 minus p hat pooled is 116.33
We have n2 times p hat
pooled equal to 512. 74
and we have n2 times 1 minus
p hat pooled equal to 298.26
so that all of these things are at least
10 so we have all the conditions satisfied
and we can use the procedures
described in the last slide.
So let's do the mechanics.
Why we need the standard error?
So we have the pooled standard
error of p hat 1 minus 1 p hat 2
using the formula stated before gives us the
square root of .3677 times 1 minus .3677 over 184
plus .3677 times 1 minus .3677 over 811
gives us a pooled standard error of 0.0394
So the test statistic then is z equals p hat 1 minus
p hat 2 divided by the pooled standard error,
or .261 minus .392 over .0394 which
gives us a z-statistic of minus 3.32
So the p value is 2 times the probability that a normal
(0,1) random variable is less than or equal to minus 3.32
which is the same thing as 2 times the probability
that a normal random variable takes a value
that's as least as large as positive 3.32
So we find this by taking 2 times 1 minus the
probability that a normal (0,1) random variable
takes a value less than or equal to
3.32 which gives us 2 times 0.0045 or 0.0009
So what does that mean?
Well that's a very small probability
so if our null hypothesis were true,
what we observed would be very unlikely.
So we reject the null
hypothesis and we conclude that
there's a difference between the prevalence
of snoring between the two age groups.
So in the two-proportion
z-test, what can we do wrong?
Things that we wanna avoid:
We don't want to use the two-sample proportion methods,
when they're not.. when the samples are not independent.
we don't want to apply these methods if
there's no randomization in each group,
and we don't want to interpret a significant
difference in proportions as evidence of cause.
Often proportions come from observational studies
which we know can't be used to determine cause
So what have we done in this lecture?
Well we basically discussed how to compare
proportions from two different populations.
So we talked about how to form confidence intervals
for the difference between two population proportions
and we talked about how to carry out a hypothesis test
for the difference between two population proportions.
We did a couple of examples, one with
the seatbelts and one with the snoring
and then at the end we talked about
the things that could go wrong.
So be sure to avoid the
things that can go wrong
and to follow all the mechanics that are described
in the previous sections of the chapter.
and this is the end of lecture 5.