Welcome back for lecture 8 in which
we'll discuss inference for paired data.
So let's start with an example to
motivate what we're gonna do here.
The question is: Do flexible work schedules
reduce the demand for resources?
The Late County Illinois health department
experimented with the flexible four-day work week.
So for a year, the department recorded the mileage driven
by 11 field workers on an ordinary five-day work week.
Then it switched to a flexible four-day work
week and recorded the mileage for another year.
So here are the data for the
11 people that they looked at.
So we have the five-day mileage and
the four-day mileage for each person.
Now we wanna perform inference on the
differences in the mean mileage.
So the question is: Can we
use a two sample t-test?
No, we can't. Why is that?
Because each observation or each set of
measurements was taken on the same person,
so each person has two
observations taken on them.
This means that the two
groups are independent.
So we violate assumption 1 for
the two sample procedures.
So what do we do?
We call these types of data,
paired data or matched pairs.
And one thing that we might think about
doing, is we look at the differences
in the four-day and the five-day
mileage for each individual
and then perform inference
on the differences.
Then what we have is we essentially have
one observation for each individual,
so we have one sample of
So what we do then, is once we have the differences,
we analyze the differences in the same way
that we would apply the one-sample
t-procedures that we discussed before.
As long as all the conditions
for that are satisfied.
So let's take the five-day
minus the four-day mileage
and re-frame our data in such a way
that it just shows the differences.
So what we get is the data you see right here.
Each individual with a difference in
the five-day and the four-day mileage.
In order to carry out the paired t-procedures,
we have to have some conditions satisfied.
First of all, we need the
paired data condition.
Which simply says that our
data come in matched pairs.
Second, we need the
So the differences have to be independent,
so they have to come from a random sample.
Third, the randomization condition.
The data must come from a random
sample or random assignment or groups.
So three and two often
take care of each other.
Four, the 10% condition.
The sample size has to be less
than 10% of the population size.
And five, the nearly
The differences have to show near normality in
order to use the t-procedures for the paired data.
So how do we carry out a paired t-test?
Well, looks a lot like the one-sample t-test.
Let's let mu D be the
population mean difference.
Then we hypothesize that mu D is
equal to some hypothesized value
versus one of the three
That mu D is less than mu 0, mu D greater
than mu 0 or mu D is not equal to mu 0.
Let's talk about the mechanics.
We'll let S D be the sample standard
deviation for the differences.
Then the test statistic is given by d bar minus mu 0
divided by the standard error of the difference of
d bar where d bar is the sample mean of the differences
and the standard errror of d bar is given by
SD over the square root of the sample size.
Under the null hypothesis, the test
statistic follow the t-distribution
with n minus 1 degrees of freedom just like
you did on the one-sample t-procedures.
So let's do the example on the
mileage data that we just looked at.
Do we have paired data?
Yes, we do.
We have two observations on each individual
so we can look at the differences.
Do we have independence?
The individuals are
likely to be independent.
The randomization condition is not stated explicitely
on the problem but we're going to assume this.
The 10% condition, the Lake County Health
Department has more than 110 field workers,
so we're good on the 10% condition.
Now to the right you see a
histogram of the differences.
And so for the nearly normal condition, we
have some problems with the normal assumption.
We have two peaks at the right skew.
So we have some problems with
the nearly normal condition.
but in order to get a feel for the mechanics
of the task, we're going to do it anyway
So let's look at the mechanics.
Well first we need our summary
statistics for the differences.
We have the sample mean
difference is 982 miles.
The sample standard deviation is 1139.568
miles and our sample size is 11.
So in this test what we're assuming
initially is that there's no difference
between the mileage for the four
day and the five day work-week.
so we're gonna assume that mu D is
zero, that's our null hypothesis.
Our test statistic then is d bar over
SD divided by the square root of n
or 982 divided by 1139.568 over the
square root of 11 which gives us 2.858
The significance level is 5%
and what we're looking to do
is to see if there's a five-day mileage on
average is greater than the four-day mileage.
So we reject the null hypothesis if our test statistic
takes the value of greater than or
equal to t 10.05 which is 1.812.
Our test statistic took a value of 2.858.
So we reject the null hypothesis and
conclude that there is evidence to suggest
that average mileage decreases during the
four-day work week versus the five-day work week.
What if we want a confidence
interval for the mean difference?
Well if the conditions for
the paired t-test are met,
then we can form a 100 times 1 minus alpha percent confidence
interval for the mean difference in the following way.
We take d bar plus or minus t* with n minus 1 degrees of
freedom times the standard error of the mean difference.
This is the same form as we had
in the one-sample t-interval.
For the mileage example, we wanna construct the
95% confidence interval for the mean difference.
So using the table, we
find that t*10 is 2.228.
We found during the hypothesis test that d bar was
982 and the standard error of d bar was 343.593.
So when we form our confidence interval, we
take 982 plus or minus 2.228 times 343.593.
And what that gives us is an interval
of 216.47 up to 1747.525 miles.
So what that tells us is that we are 95% confident
that the average mileage for the five-day work week
is between 216.4748 and 1,747.525 miles
higher than that for the four-day work week.
With the paired t-test, there are a
bunch of things that can go wrong.
So here are some things that we want to avoid.
We don't want to use a two-sample
t-test when we have paired data
because we know that our groups are not
independent if we have paired data.
We don't want to use a paired t-procedure
when the data are not paired.
So those first two things kinda go together.
Don't forget to look out for outliers.
This can indicate problems with
the nearly normal assumption.
And do not use side by side boxplots or
histograms to look for the difference
between the means of the paired groups because
they're not from two different groups.
So we're not doing this as we
would for a two-sample t-test.
So what have we done in this lecture?
Well, we examine the difference between
paired data and the type of data
that enables us to use a two-sample t-test.
We described how to carry out the paired t-test as well
as how to construct a paired t confidence interval
for paired data and for the average
difference for paired data.
We finished up by looking at
some things that can go wrong
and things that we wanna avoid when
we use the paired t-procedures.
This is the end of lecture 8 and I look
forward seeing you back for lecture 9.