Welcome to Lecture 9, in which we'll address Randomness and Survey Sampling.
So randomness, you hear about this a lot but what’s the big deal about random sampling?
We often emphasize in statistics that the sample must be drawn randomly, but why?
Well, random sampling has two key advantages.
First of all, nobody can guess the outcome before it happens
and we want things to be fair, so we want outcomes to be equally likely.
By doing this, we ensure that separate outcomes do not affect each other,
and in other words, outcomes of separate draws are independent.
Random sampling is actually fairly hard to do,
but there’s several software packages that do it for you
including Excel, SAS, R and several internet widgets.
Let’s look at sample surveys. There are three key ideas in survey sampling.
The first is to examine a part of the whole.
Next, we randomize, and the big key is it is the sample size -
the sample size is paramount in survey sampling.
In this section, what we’re gonna talk about our types of sampling,
bad samples and consequences, and we’ll also address how we can prevent bad sampling.
Examining part of a whole. Why do we sample?
Well, the goal is to learn about an entire population of individuals,
but the problem is it’s usually impractical or even impossible to examine all individuals.
The solution is to draw a sample from the population.
We do this every day.
For example, if you’re interested in knowing how a pot of soup that you’re making taste,
you’re not gonna eat the whole pot, that'd be mean
because you’ve been making a big pot soup,
you’re probably going to feed more than one person,
so you don’t wanna eat the whole thing.
You take a spoonful or two
and then you trust that these spoonfuls are representative of how the pot taste.
Here the population would be the whole pot of soup
and the sample is just the two spoonfuls.
In sample surveys, what we do is we give them to a small group of people
in hopes that they will give an insight into the views of the entire population.
Examples of these are opinion polls, election surveys
such as how do you plan to vote and exit polls once you leave the polls, how did you vote?
The “survey”, in the soup example, is asking how “does the soup taste?”
And the hope is that the taste of the two spoonfuls of soup that you took
are representative of how the entire pot tastes.
Bias is a big problem in many surveys
and this happens when the sample is not representative of the population.
How do we avoid these problems?
Well, the main goal of survey sampling is to get a good sense
of what the views are in the entire population.
In order to do this, we try to get a sample that represents the population.
Sampling methods often over or under-emphasize some characteristics of the population,
and these types of samples are said to be biased.
For example, in 1936 of presidential election poll
for the election between Alf Landon and Franklin Delano Roosevelt was conducted.
Literary Digest used a sample of 2.4 million ballots from 10 million that they mailed out,
but some problems occurred with that.
The result of the poll was that Literary Digest predicted
that Alf Landon would win the election with 52% of the popular vote
to 43% for Franklin Delano Roosevelt,
but Roosevelt won the election with 62% of the popular vote to 37% for Landon.
But what Literary Digest did was they used the phone book to choose addresses to send the polls to,
but in 1936, right after the depression, phones were luxury
so the poll sampled more wealthy people than poor voters
and the result was that the sample was not representative of the population.
They totally undercovered the poor part of the population.
Modern polls get around this problem because they get representative samples
by selecting the individuals in the sample at random.
When we do random sampling, what we’re doing, if we go back to the soup analogy,
is we’re stirring the pot. How do we get a representative sample?
Well, suppose we add salt to the pot and then we just taste the soup from the top
without stirring the pot or without stirring the salt in first.
Well, if you just take the spoonfuls from the top,
then that might lead you to believe that the soup is salty
because the top part of the soup is where you put the salt,
and in order to put the salt in and randomize the pot, you need to stir it up.
Stirring the pot in survey sampling is done by randomizing.
How do we get a representative sample?
Well, randomizing protects us from the influence of many features of the population
by ensuring that on average, the sample looks a lot like the population.
This is a lot easier than trying to match the sample to the population
because we can’t account for every characteristic of the population
and we can’t usually match the sample to the population for each of these characteristics.
In survey sampling, it’s all about the sample size and the question is often,
“is our sample large enough?”
How large of a random sample do we need for the sample
to be reasonably representative of the population?
A common thought is that we need a large percentage of the population.
But this isn’t really true, the population size doesn’t really matter.
All that matters is the number of individuals that are in our sample.
For example, if we take a sample of 100 students at a particular college,
that represents the student body about as well as a sample of 100 voters
represent the electorate in a particular state,
even though the population size is likely to be vastly different.
One way to get information about the population is to take a census.
This is where we ask everyone.
Formally, a census is a sample that includes everyone in the population,
so we just look at everybody. There are lot of problems with the census.
First of all, a census can be really impractical or even impossible to carry out.
Populations change, babies are born, people die
and so there are people that are entering and leaving the population all the time,
and it can be very complex to do.
Surprisingly maybe, a census may not give us accurate information
about the population as a random sample would, and this goes back to the second point.
By sampling everybody, we’re not accounting for changes in the population
but random samples remain representative of the population.
Let’s look at the population parameters and sample statistics.
In the example, we’re trying to figure out how the population feels about a particular issue
or what percentage of population feels a particular way about a certain issue.
This percentage is known as the population parameter.
This is an unknown quantity that pertains to the population.
We use the percentage from the sample to estimate the population parameter.
This is an example of what we call a sample statistic.
So to help you remember this, just remember that the piece go together,
parameters pertain to the population, statistics pertain to the sample.
A sample statistic, formally, is any summary or quantity calculated from the sample data.
Other parameters that we’ve dealt with before were the population mean
and a population standard deviation in the section where we talk about the normal distribution.
What’s the bottomline?
The bottomline is that we draw samples because we can’t work with the entire population,
but we want the sample to be representative of the population.
How do we select a representative sample? Well, we're gonna discuss four ways to do it.
The simplest is the simple random sample. We'll discuss stratified sampling.
We'll discuss cluster and multistage sampling, and we’ll discuss systematic sampling.
All these methods provide a mechanism for selection of a representative sample from the population.
Let’s start with the simple random sample. What is it and how do we get it?
Well, a simple random sample is a sampling scheme that has two key properties.
First, every possible sample of the same size has the same chance of being selected.
Secondly, every individual has the same chance of being selected.
In other words, every combination of individuals has the same chance of being selected.
There’s not one sample that’s more likely than another under simple random sampling.
In order to do it, we need to first define where the sample comes from,
and we call this a sampling frame.
The sampling frame is a list of individuals from which the sample is drawn,
and the easiest way to use this is to draw a simple random sample
is to assign a number to each element in the sampling frame
and then, randomly chooses a set of numbers to comprise our sample,
where each number is equally likely.
Samples drawn at random in variable going to differ from one another,
so you draw one sample, you draw another one, they’re going to be different.
This leads to different values of statistics for each sample that we draw.
And the sample-to-sample differences are known as sampling variability.
This variability is not a problem, it’s just the nature of the beast,
and it doesn’t mean our sample is not representative of the population.
Let’s do an example of a simple random sample.
Suppose we have 80 students enrolled in an introductory statistics course,
and we wanna sample 5 of them.
We’ll assign each one a number, 1 to 80, no repeats, and we’re gonna draw 5 of them at random
using some sort of random number generator, placing numbers into a bowl
and shaking the bowl up, and the numbers we get out are 24, 13, 2, 74 and 54.
The students with these numbers are the individuals in the simple random sample.
The next type of sampling we’re gonna look at is stratified sampling,
and this is -- this boils down to sampling from different groups.
Sometimes we need a more complicated sampling scheme
in order to draw a representative sample from a large population,
and one way to handle this is is to slice the population into homogeneous groups called strata,
and then we carry out simple random sampling in each stratum
and then the samples are combined, and this is what we know as a stratified sample.
Let’s look at an example of a stratified sample.
At a large university, there’s a question about how people feel about their football team.
The campus is 60% men, 40% women. We wanna sample a hundred people.
We might wanna do a stratified sample here.
If we do a simple random sample, we could end up with 80 men and 20 women.
This is not representative of the population, men are over-represented, women are under-represented.
We’re gonna divide the population into strata by sex
and then we’re gonna take simple random samples of 60 men and 40 women
from each of those groups and then, combine the results of these two simple random samples.
These reduces the sampling variability and this is the most important benefit of stratified sampling.
We also have cluster and multistage sampling, and we’ll do this one by example.
Supposed wanna assess the reading level of a book based on the length of the sentences in that book.
How might we do it?
Well, simple random sampling might be awkward if we number each sentence
and then find, say, like the 242nd sentence and the 3965th sentence in the book.
We might consider picking a few pages at random
and then count the lengths of the sentences on those pages,
and that works if we believe that each page is representative of the entire book,
in terms of reading level.
Splitting the population into representative clusters can be more practical.
We choose a few clusters and then perform a census in each of them.
This is called cluster sampling.
To formally describe a cluster sample, let’s talk about how we collect it.
Clusters are usually selected for reasons of practicality, efficiency, or cost.
If each cluster represents the population fairly,
then cluster sampling will provide a representative sample.
We have to be very careful in how we choose the cluster, however,
because we wanna avoid introducing bias into our sample
and we’ll talk more about bias here in a few minutes.
This is not the same stratified sampling,
stratified sampling is done by dividing the population into groups
based on one or more characteristics and then doing simple random sampling in each group.
Strata are not representative of the population. Clusters are.
Sometimes we do multistage sampling and this is where we combine sampling methods together.
For example, in the reading level example,
maybe we think that the reading level increases as the book goes on
because the concepts get harder.
We would wanna avoid samples selected heavily from the early parts of the book
or from the later parts of the book.
Supposed the book has 7 parts, we might randomly choose one chapter from each of the 7 parts,
and then randomly select a few pages from each of those chapters.
We may then randomly select a few sentences from each of those pages.
Let’s look a little bit closer on how this is working.
The sampling scheme is as follows: we first stratified by the part of the book.
We randomly chose a chapter from each part.
In each chapter, we chose pages as clusters, and then for each clusters,
we did a simple random sample of sentences in that cluster.
There were three stages of the sampling here, we have the stratified first,
then the cluster sampling, and then the simple random sample to finish it off,
and this is an example of a multistage sampling scheme.
One more method of sampling that we can use to obtain a representative sample of the population
is known as systematic sampling, and the idea is, for instance,
we wanna survey every 20th person from an alphabetical list of students.
What we would have to do is to randomly choose the starting point
in order to get a representative sample.
This makes the order of the list unassociated in any way with the responses that we want
and this type of scheme is known as a systematic sample.
Now, let’s look at the problems we might run into a surveys and the desirable properties of surveys.
We want valid surveys, which means we’re getting the information we hope to get.
Does valid survey provides the information we’re hoping to obtain,
or in other words, that answers the questions that we’re trying to ask?
Four questions we need to ask before we set out to survey.
What do I wanna know? Am I asking the right people?
Am I asking the right questions?
What would I do with the answers if I had them, do they address the things that I wanna know?
Common pitfalls in surveys, you need to know what you wanna know, what do you hope to learn?
You need to use the right sampling frame.
In order to get a representative sample, the sampling frame has to match the population.
You don’t wanna ask questions that you don’t need to know the answer to,
and you wanna ask specific questions instead of general ones.
For instance, what was your grade in a class?
Is a much better question to ask than, how did you do in that class?
You wanna ask for quantitative results when possible.
You wanna be careful in how you phrase your questions.
Often, it’s a good idea to do a trial on the survey that you eventually plan to give to a larger group,
so that you can help identify problems before you survey too many people
and have to deal with all of that information.
This type of survey is known as a pilot study.
Let's look at what’s not to do? Common errors in survey sampling.
Don’t sample volunteers, under any circumstances, don’t use volunteers as your sample.
A volunteer response sample is a sample in which a large group of people are invited to respond
and those who choose to respond are counted.
There are many problems with this, but the biggest ones are that,
people who have really strong feelings about the questions at hand
are the ones who are most likely to respond and this leads to bias in your survey.
Internet surveys are the biggest culprits of this,
like the ones that go on a news website and on espn.com, for example,
are volunteer response samples cuz they just give the question
and if you visit the website, you can answer the question,
this is a volunteer response sample.
It’s really hard to define a sampling frame for these, so in general it’s best not to use them.
Don’t sample because of convenience.
When we talk about convenient sample,
what we mean is a sampling which we include only individuals who are convenient for us to sample,
and these are usually biased as well.
For example, if we go to the grocery store
and we’re wondering about the percentage of the population that likes vegetables,
and we stand at the produce section of the grocery store
and we just ask people who come by whether or not they like vegetables.
Well, you’re probably not gonna get a representative sample
because the people who don’t like vegetables are probably not going to the produce section.
This is a convenience sample, because it’s convenient to sample the people
who come through the place where you're standing.
This is probably going to be a biased sample
because it leaves out the people who don’t buy produce at the grocery store
or people who don’t come thorugh the produce section, at all.
We don’t wanna use a bad sampling frame.
Again, this is very important -- you must make sure that your sampling frame matches your population.
For example, simple random sample from an incomplete sampling frame
introduces bias since there are parts of the population that might not be represented,
such as like people in prison, homeless people or students that may be missed.
We need to be careful of undercoverage.
Undercoverage is just the under-representation of a particular group of the population in the sample.
For example, a survey about the satisfaction with a particular product ordered online
that is only offered to the person who ordered it
may under-represent some groups of the population.
Often products are ordered online that other people use besides the person who ordered it,
sometimes given as gifts, so it’s gonna under-represent the population of people
who are actually using that product, and this is an example of undercoverage.
Several types of bias that affect our surveys, one is non-response bias.
Not everyone who's asked to complete a survey will do so, you might run out of time,
you might just not feel like you're doing it.
Those who do not respond might differ from those who do
and possibly just on variables of interest.
We also have response bias, and this comes from anything in the survey
that can influence the responses.
The biggest offenders for response bias are the wording of the questions.
For example, questions about prior illegal or embarrassing activities,
those suffer response bias because people don’t typically wanna admit to that,
especially to somebody who’s just out there, giving them a survey
and never met this person before, so they don’t wanna answer the questions honestly.
How do we prevent bias? Well, there are ways to do it.
We look for biases in any survey that we encounter,
and there’s no way to come back from a biased sampling method.
You can’t take a large enough sample size to overcome a bad sampling scheme.
We need to spend time and resources reducing biases.
This should be our biggest focus when we construct our survey.
We need to think about members of the population who could have been excluded from the study.
If possible, you wanna do a pilot study, and you always wanna report your sampling methods in detail,
that way, if any questions come up about how the survey was conducted, they’re easily answered.
Right. What we’ve done, we’ve talked about different types of sampling,
why we randomize, how to prevent bad samples,
what makes a bad sample and problems in sampling that we can avoid if we do it correctly.
This is the end of Lecture 9, and we’ll see you back here for Lecture 10.