Hello. I’m David Spade.
Now, I would like to welcome you to Statistics 1: Linear Regression and Probability.
We’re going to begin this course with an introduction to statistics
basically just definitions of terms that are commonly used in statistics.
First of all, what is Statistics?
Well, statistics is simply stated the science of collecting data and analyzing variation of that data.
When we talk about data, what we mean is information pertaining to a group of individuals.
Individuals can be human, plants, inanimate objects, websites, all kinds of different things.
We have several different types of data.
Data can be numerical which means they have a numerical quantity as their measure.
Data can be categorical which means that the measurement on the individual
represents what category that individual falls into.
Regardless of the type of data, it needs to be presented in some sort of context.
In statistics, we’re typically interested in entering 5 questions,
the 5 Ws: Who, What, When, Where, and Why.
We’re gonna focus on the who and what.
The who are the individuals on whom the data are collected.
Individuals who respond to a survey are called respondents,
people on whom experiments are conducted are known as subjects,
and animals, plants, websites, and inanimate objects, any other thing,
on which data are collected are known as experimental units.
These are all whos. These all answer the question of who?
The what simply refers to what is it that we’re measuring.
The what corresponds to the characteristics that we record about each individual.
These characteristics that we record are known as variables.
Often, the individuals and the corresponding variables are a sample from a larger population.
We are most concerned with finding answers to the who and what questions,
so that’s what we’re gonna focus on.
Let’s look at Populations and Samples.
We mentioned them briefly during the what section.
A group of individuals about which we wanna make a general statement is known as a population.
In order to do that, it is typically impractical to look at the whole population, so we take a small group.
A group of individuals that are selected from a population is known as a sample.
The sample is then used to make generalizations about the population.
For example, suppose we collect a sample of a hundred students from a particular university
and we find their grade point average and we look up the high school they attended
with, in an effort to determine how the high school that they attended affects their performance in college.
The questions we wanna answer are: What are the who and the what of the study?
What are the population and sample here?
The who. The who are the individuals that we sample,
the individuals on whom we are collecting information.
In this study, the 100 students are the individuals on which the data were collected.
The data are the GPA and the high school.
The what. We’ve briefly mentioned that already.
We’ve mentioned the GPA and the high school the student attended, these are the what.
What’s the population? Well, who are we trying to make general statements about?
The students at this university. That’s our population, all the students at this university.
The sample. The sample is the hundred students that we chose.
We have several different types of variables.
One is an identifier variable and this is a variable for which each individual receives a unique value.
They can be useful but not so much for analysis. Examples include student IDs and social security numbers.
Categorical variables are variables that tell a group or category to which an individual belongs.
Gender, political party, and age group are all categorical variables.
A quantitative variable is a variable that contains numerical values with measurement units
or which have a meaningful numerical value.
Speed, age, and weight are examples of quantitative variables.
All right, so back to the GPA and high school example.
Suppose that for a group of high school students,
we measure their GPAs as well as the high school that they’re attending.
What are the variables in this study and which type of variable is each one?
Well, the variables, high school and GPA. Those are the things that we’re measuring.
Grade point average, this is a quantitative variable
because the value of the GPA has a meaningful numerical quantity.
The high school is a categorical variable.
It classifies the student based on which high school they attended.
This is a categorical variable because it’s putting students into categories.
Now, how do we summarize categorical data? Well, let’s look at how we do it graphically and tabularly.
We can do it by way of a frequency table which is simply a listing of each category
and how many times it occurs on the sample.
The bar chart shows counts for each category next to each other
and the count is represented by the height of the bar.
The bar chart can also be used to represent the relative frequency of each category.
The relative frequency is the percentage of the time that the category occurs in the sample.
Again, the relative frequency, the percentage of the time the category occurs in the sample.
We can also use a pie chart which shows the relative frequency of each category
as a proportional slice of a circle.
Before using any of these displays,
we need to be careful and make sure that the variable that we're representing is, in fact, categorical.
If it’s not then these displays are meaningless.
Let’s talk about some of these displays. Let’s look at our university example.
We have 14 from high school A, 16 from high school B, 20 from high school C, 20 from high school D,
13 from high school E, and 17 from high school F.
What we’re gonna do is we're gonna make a frequency table, a bar chart, a relative frequency bar chart,
a frequency bar chart, and a pie chart to describe the distribution of the high schools among our sample.
Here’s the frequency table. Across the top you see all the categories High School A, B, C, D, E, and F.
We have the frequencies 14, 16, 20, 20, 13, and 17
and the corresponding relative frequencies 0.14, 0.16, 0.2, 0.2, 0.13, and 0.17.
What we see is we basically made a frequency table and a relative frequency table at the same time.
Now, what about the bar chart?
Here’s the frequency bar chart with high school 1, 2, 3, 4, 5, and 6 as the labels.
One corresponding to A, two corresponding to B, six corresponding to F.
Note that the height of each bar corresponds to the number of times that each high school appears on the sample.
The height of the first bar goes up to 14 because that’s how many students came from high school A.
The height of the second bar goes up to 16 because that’s how many students came from High School B, and so on.
We can also look at the relative frequency bar chart
which looks exactly the same as the frequency bar chart except for, if you look at the y-axis,
now the relative frequencies are going down the side as opposed to the actual frequencies.
Here’s the pie chart of the high schools in our sample
and what we see is that each high school corresponds to a different color
and the size of the slice corresponding to each high school
is representative of the relative frequency of that high school in our sample.
Those are some examples of how we can summarize categorical data graphically.
There are some things that can make a chart bad.
The biggest problem we have in categorical data displays is violation of what’s known as the area principle
and what the area principle says is that the area occupied by a part of the graph
should correspond to the magnitude of the value it represents. Let’s look at some examples of bad graphs.
If we have bar charts with mismatched bar widths, then this is bad because of it violates the area principle.
Say high school A and high school B in our example had the same number in them.
Say each had 14 students. If the bar for high school A is wider than the bar for high school B,
it’s going to appear that the high school A makes up more in the sample than high school B
because the areas of the bars aren’t the same.
We don’t want that. We want all of our bars to have the same width.
Three-dimensional pie charts are inherently bad because when you make a pie chart,
you usually divide things up by the area of the circle but once you get into three dimensions,
you have to worry about volume and people don’t typically account for the volume.
Don’t use three-dimensional pie charts as a way to represent categorical data.
Picture graphs, these are often the worst offenders of the area principle
because the size of the picture, the area of the picture is not often representative
of the relative frequency of the category’s appearance in the sample.
You want to stay away from those as well.
We have these variables that are kinda hybrids of quantitative and categorical variables.
One really common example is Lichert Scale Data.
As an example, the question might be how often do you exercise
and the possible response is might be 1 for never, 2 for seldom,
3 for sometimes, 4 for often, and 5 for everyday.
The responses do constitute categories but the numbers have a particular order to them.
It turns out that the order of the numbers does correspond to the order of the amount of exercise.
They do have a numerical meaning.
These responses, while categorical, do also have some properties of quantitative variables
and it’s important to be aware of these types of things when we look at what type of data we’re analyzing.
All right, so in summarizing data, there are several mistakes that we can make
and I’m gonna give you a few of the common ones right here.
We need to think about what question we wanna answer
when we decide whether to label a variable as a categorical and quantitative variable.
We need to always be skeptical. We wanna be careful to think about how the data were collected.
As we move on in the course, we’ll learn more about what makes a good sample and what makes a bad sample.
Investigators will sometimes contrive data in such a way that it produces a desirable result.
While unethical, it sometimes happens and we need to be careful and look out for that.
We need to be careful to recognize that, even though a variable’s values might be numbers,
that doesn’t necessarily mean it’s a quantitative variable
because it doesn’t mean that the numbers have any real meaning to them.
They might just be codings.
Categorical variables are often coded with numbers but they don’t necessarily have a meaning.
For example, let’s say we code males as 1, females as 2.
These categories would mean the same thing if they were coded 1 for female and 2 for male,
so gender is a categorical variable.
We’ve listed the things that can go wrong.
These are the things we need to avoid when we’re summarizing categorical data.
We’ve talked about some basic terms in statistics, things that are commonly talked about.
We’ve talked about how to display and summarize categorical data.
This is the end of Lecture 1 and we’ll see you back here for Lecture 2.