Playlist

Categorical Data Analysis

by David Spade, PhD

My Notes
  • Required.
Save Cancel
    Learning Material 2
    • PDF
      Slides Statistics pt2 Categorical Data Analysis.pdf
    • PDF
      Download Lecture Overview
    Report mistake
    Transcript

    00:01 Welcome back for lecture 9 What we're going to discuss is categorical data analysis.

    00:06 Let's begin with an example to motivate what we're going to do in this lecture Is your zodiac sign a useful predictor of how succesful you will be later in life? Fortune magazine collected the zodiac signs of the heads of 256 of the largest 400 companies Below is a table of the number of births under each sign.

    00:24 The question is, are the variations in the number of births per sign just due to chance? or are succesful people more likely to be born under some signs than others? So now we need to analyze the data in the table, so we need to figure out what won't work.

    00:39 Well the one sample z test for proportions will not work here because there are twelve proportions we need to be concerned about - one for each zodiac sign.

    00:48 Similarly, the two-sample z-procedures that we talked about won't work here So then the question becomes, what will work? Well we're gonna do what's called a Goodness-of-Fit test, and let's see how we do it We're aiming to determine whether a higher percentage of succesful people are born under a different sign than are others.

    01:07 We would initially hypothesize that the proportions of successful people born under each sign are equal.

    01:14 So let's let p i be the true proportion of successful people born under each sign where i goes from 1 to 12 Our hypotheses are, our null hypothesis will be that all 12 of these proportions are equal and our alternative will be that at least one proportion differs from the others.

    01:33 Just like we have with any other test, if we wanna do the Goodness-of-Fit test, we have to have some conditions satisfied.

    01:40 First and foremost, we need to make sure that our data are counted data.

    01:44 So the values in each cell of the table must be counts of the number of observations in the category corresponding to that cell.

    01:52 Randomization.

    01:54 The individuals who have been counted are random sample from a population of interest.

    01:59 Thirdly, the 10% condition.

    02:02 the sample again has to be less than 10 percent of a population Four, the expected cell frequency condition.

    02:09 We should expect to see at least five individuals in each cell.

    02:14 So let's look at the expected cell frequency condition.

    02:17 For each cell, the expected count is n times p 0 i where n is the sample size and p 0 i is the hypothesized proportion in category i in this example, n is 256 - that's the number of people that we sampled.

    02:34 We're hypothesizing that each proportion is equal.

    02:37 So for each one, P 0 i is 1 over 12.

    02:42 therefore the expected count in each cell is 256 times 1 over 12 or 21 and one-third.

    02:49 So the expected cell frequency condition is easily satisfied here.

    02:54 Do we have counted data? Yes.

    02:56 The data in our astrology example are counts of people born under each sign.

    03:01 Do we have randomization? We don't have a random sample here, but their births should be randomly distributed throughout the year so we can assume independence.

    03:11 10% condition.

    03:12 This is definitely not satisfied but it's still reasonable to think that the births are independent so we're not too worried about this here.

    03:20 Have we verified the expected cell frequency condition? Already.

    03:25 So what's with the mechanics of the Goodness-of-Fit test? Our test statistic is given by this character which we call chi-square which is the sum overall cells of the observed minus the expected frequencies squared divided by the expected frequency.

    03:39 Under the null hypothesis, our test statistic has what is called a chi-square distribution with k minus 1 degrees of freedom where k is the number of cells in the table For a level alpha test of our null hypothesis, we reject the null hypothesis for large values of the test statistic, in other words, we reject when our expected frequencies under the null hypothesis are far away from the observed frequencies.

    04:05 Formally, we reject our null hypothesis when our test statistic takes a value larger than the critical value for a chi squared n minus 1 distribution These critical values can be found in the table that I'm gonna show you on the next slide.

    04:19 We match up the degrees of freedom, and the significance level to find the critical value, just like we did with the t-table.

    04:25 p-values are hard to find, and requires the use of software, so we're gonna restrict ourselves here to just finding critical values for this test.

    04:34 So here's the table, down the left margins of the table are degrees of freedom, and across the top are significance levels.

    04:40 So we match up the degrees of freedom to the significance level and the cell corresponding to that match gives you the critical chi-squared value So let's try one.

    04:50 Let's use the chi-squared table to find the critical value for a level alpha equals .05 test in our example.

    04:57 We know that since there are 12 cells, 12 counts, we have 11 degrees of freedom.

    05:02 So we find in the table, we match up 11 degrees of freedom with .05 from the top and we get a critical value of 19.6751 Therefore, we're gonna reject our null hypothesis if our chi-squared statistic is larger than 19.6751.

    05:21 So let's try it.

    05:23 the test statistic is the chi-squared statistic which is the sum of the observed minus the expected squared So when we do that, when we carry out that calculation. we get the test statistic value of 5.09 This is smaller than the critical value of 19.6751 so we do not reject the null hypothesis.

    05:43 What does this tell us? Well what it tells us is that there is no evidence at the 5% level that the zodiac sign of a person is a predictor of their success level later in life.

    05:54 So that's the chi-square test for Goodness-of-Fit.

    05:57 Sometimes, we wanna compare several distributions and see if they're the same.

    06:01 For example, many universities survey their graduating classes to determine their plans after graduation.

    06:08 What we see here is a two-way table for a class of graduates from several colleges at the university Each cell shows a number of graduates that made a particular choice.

    06:18 So down one side, down the left side, are the plans and across the top are the colleges and the school So the hypotheses are as follows: The null hypothesis is that the student's post graduation plans are distributed the same way for all four colleges.

    06:36 the alternative hypothesis is that the student's plans do not have the same distribution in each college.

    06:42 And mathematically, it's kind of a pain to ID out, so we're gonna leave these hypotheses as they are right now Just like we have before, we have certain conditions we need to have satisfied.

    06:53 First of all, we need to make sure that we have counted data All of our cells represents counts of graduates in a category.

    07:00 We need independence or randomization Again, this is not a random sample, but it can be reasonably assumed that the students' plans are largely independent of each other so we're good there.

    07:10 Three, the expected cell frequency counts.

    07:14 We again need the expected frequency in each cell to be at least five So let's check them out, let's see how we compute expected frequencies.

    07:23 For the agriculture students who were employed, overall, 685 or about 47% of the 1456 students were employed If the distribution were all the same, then 47% of the 448 agriculture graduates or 210.769 would be employed.

    07:43 47% of the 374 engineering graduates will be employed, or 175.955 So what we've done is we've taken to get the expected frequency and cell i j, we take the total in row i times the total in column j and then divide by the total number observed So here what we took for the agriculture and employed cell, was 685 times 448 divided by the total of 1456 and that gave us the 210.769 And we can do that same thing for every other cell in the table So here's the table of the expected frequencies.

    08:21 All of these are much larger than five so we're okay on the expected cell frequency condition.

    08:27 The test statistic is calculated exactly the same way as before.

    08:31 It's the sum of the observed minus the expected frequency squared, divided by the expected frequency.

    08:38 Under the null hypothesis, if the table has all rows in c columns, then the test statistic has a chi-square distribution with r minus 1 times c minus 1 degrees of freedom.

    08:49 So in this example, we have three rows and four columns so we have 3 minus 1 times 4 minus 1 or six degrees of freedom.

    08:59 If we're testing at the alpha equals .05 level of significance, we're gonna reject H0 then If our chi-square statistic is larger than 12.5916 we get this from the table by matching up 6 degrees of freedom, with the 5% significance level at the top When we calculate the test statistic, we put it into that long formula that we've seen before, then we get our chi-squared statistic is 93.66 So we reject the null hypothesis, and we conclude that there is evidence of a difference in the distributions of post graduation plans among the graduates.

    09:32 Again, a sidenote, looking at the table of expected frequencies, shows that the expected cell frequency condition is satisfied so we're good on that.

    09:42 Finally, we want to look at the chi-square test of independence.

    09:45 One question that we often ask is, is one variable related to changes in the other? So let's look at a study that involves hepatitis C and tattoo status.

    09:56 A study examines 626 people being treated for non-blood related diseases to see whether the risk of hepatitis C was related to whether people had tattoos and where they got them from.

    10:07 The data are summarized in this table that you see below.

    10:10 We have the tattoo status on the side whether they got it in a parlor or elsewhere or they don't have one.

    10:15 and the hepatitis or no hepatitis across the top.

    10:20 So the natural question then is, is the chance of having hepatitis C independent of where they got the tattoo or whether they have one? This would mean that the distrubution of hepatitis C is equal to the conditional distribution of hepatitis C given the tatoo status for all tattoo statuses.

    10:38 So this is very much like the test for homogeneity where we're testing for equality of distributions And the mechanics are exactly the same.

    10:46 The only difference here is that in the test for homogeneity, we look to two different populations but here the categorical variables are measured on only one population So how do we do it? Well the first thing, just like with any other hypothesis test, is to setup our hypothesis.

    11:04 Our null hypothesis in this example will be the tattoo status and hepatitis C are independent.

    11:10 Our alternative then is that they're not independent.

    11:14 So again we have the conditions.

    11:16 Counted data? Yes, our data is represented in the table is counted.

    11:21 Independence.

    11:22 The people in the study are likely to be independent of each other so we're good there.

    11:26 Randomization.

    11:28 The data are from a retrospective study of patients being treated for something other than hepatitis they're not a random sample but they're likely to be representative of a population so the randomization is still okay.

    11:39 The 10% condition? Well 626 people's fewer than 10% of all people with tattoos or with hepatitis C So the 10% condition is fine Let's look at expected cell frequency.

    11:52 The calculation of expected cell frequencies is exactly the same as before, for cell i j, the expected frequency in that cell is equal to the number in row i times the number in column j divided by the total number observed.

    12:05 All of these need to exceed 5 So here's the table of the expected frequencies and what we see is that we have a couple, the tattoo and hepatitis C and the parlor and the elsewhere tattoo and hepatitis C are both less than 5 So we don't quite have the expected cell count condition satisfied.

    12:22 Just so we can get a feel for the mechanics of the test, let's use this data and carry it out anyway.

    12:28 Under the null hypothesis, the test statistic has a chi-square distribution with r minus 1 times c minus 1 degrees of freedom where again r is the number of rows, c is the number of columns In our example, we have three rows, two columns so the test statistic has a chi-square distribution with 3 minus 1 times 2 minus 1 or 2 degrees of freedom.

    12:50 Therefore , if we're looking to carry out the task at the alpha equals 0.05 level, then we reject the null hypothesis if our chi-squared statistic takes a value at least as large as 5.9915 and again, we find this in the table.

    13:05 What we observed is our chi-squared statistic is equal to 57.91.

    13:11 So we reject the null hypothesis and we conclude that there is evidence at the 5% level that hepatitis C and tattoo status are not independent.

    13:22 In categorical data analysis, there are a bunch of things that can go wrong so let's look at some of the things that we want to avoid.

    13:28 First of all, do not use the chi-square methods unless your data are counts.

    13:33 We wanna be aware of more examples.

    13:35 your degrees of freedom do not increase with sample size here so it seems strange to say that we want to be where the large sample size is, but we do because now our degrees of freedom are related to the number of categories and the number of cells and not the total sample size so we wanna be careful with that.

    13:54 Finally do not say that one variable depends on the other just because they're not independent This statement implies cause which we can't always infer.

    14:03 So what have we done in this lecture? Well we have talked about three tests for categorical data.

    14:08 We talked about the test for a particular distribution, which is the Goodness-of-Fit test.

    14:14 We talked about the chi-square test for homogeneity where we looked to see if a group of distributions are equal to each other, and we looked at the chi-square test of independence to see whether or not two variables are independent within one population.

    14:28 We finished up by looking at some of the pitfalls of categorical data analysis and the things that we want to avoid when trying to do it.

    14:35 This is the end of lecture 9 and I look forward to seeing you back again for lecture 10.


    About the Lecture

    The lecture Categorical Data Analysis by David Spade, PhD is from the course Statistics Part 2. It contains the following chapters:

    • Categorical Data Analysis
    • The Chi-Square Table
    • Astrology Example
    • Natural Question
    • Pitfalls to Avoid

    Included Quiz Questions

    1. The goodness-of-fit test is appropriate for sample means.
    2. The goodness-of-fit test is appropriate if data medians must count.
    3. The goodness-of-fit test is appropriate if the data comes from a random sample.
    4. The goodness-of-fit test is appropriate when we expect to see at least five individuals in each cell.
    5. The goodness-of-fit test is appropriate when we want to evaluate the difference between observed and expected values.
    1. The null hypothesis is that the population distribution of one categorical variable is the same for each level of the other categorical variable.
    2. The null hypothesis is that the population proportions are the same for each cell.
    3. The null hypothesis is that the population distribution of one categorical variable is different for each level of the other categorical variable.
    4. The null hypothesis is that the population proportions are different in each cell.
    5. The null hypothesis is that the population distribution is the same for at least one different level of the other categorical variables.
    1. The null hypothesis is that two categorical variables are independent.
    2. The null hypothesis is that two categorical variables have a linear relationship.
    3. The null hypothesis is that the distribution of one categorical variable is the same for each level of the other categorical variable
    4. The null hypothesis is that the population proportions are the same in each cell.
    5. The null hypothesis is that two categorical variables have a logarithmic relationship.
    1. The population distribution must be normal.
    2. Patients are randomized if appropriate.
    3. The individuals in the study are independent.
    4. The sample size must be less than 10% of the population of interest for each categorical variable.
    5. The patients are likely to be representative of the population.
    1. A rejection of the hypothesis of independence between two categorical variables means that the change in one variable causes the change in the other.
    2. The chi-squared methods can not be used for data that are not numbered.
    3. Large samples are not necessarily good for categorical data analysis because the degrees of freedom do not increase with sample size.
    4. The goodness-of-fit test, the test for homogeneity, and the test for independence are all based on the χ² distribution.
    5. Goodness-of-fit test data should be spread out in a random manner.
    1. 5.3
    2. 5.1
    3. 5.2
    4. 5.4
    5. 5.5
    1. 101.2
    2. 91.2
    3. 111.2
    4. 121.2
    5. 131.2
    1. Row variable and column variable are independent
    2. Row variable and column variable are dependent
    3. Row variable and column variable are associated
    4. Row variable and column variable are correlated
    5. Row variable and column variable are ambiguous

    Author of lecture Categorical Data Analysis

     David Spade, PhD

    David Spade, PhD


    Customer reviews

    (1)
    5,0 of 5 stars
    5 Stars
    5
    4 Stars
    0
    3 Stars
    0
    2 Stars
    0
    1  Star
    0