Table of Contents
Image: “Data analysis” by Deedster. License: CC0 1.0
Introduction
Categorical data refers to data that can be classified into groups. It is also called nominal data or qualitative data. Gender, marital status, and income group are some examples of categorical data.
The figure below, for example, shows categorical variables: the sources of electricity generation. The data are classified into six groups:
- Natural gas
- Hydro-electric
- Biomass
- Nuclear
- Wind
- Other
The sources of electricity generation are called categorical data.

Image: Pie chart showing sources of electricity generated in New York. by: Aflafla1. License: CC0 1.0
Various tools and techniques can be used to analyze specific kinds of categorical data, including the following:
- Categorical variables stand alone.
- Only the dependent variable is categorical.
- Only the independent variable is categorical.
- Both the independent and dependent variables are categorical.
Tools and Techniques: Categorical Data on a Standalone Basis
The tools used to analyze categorical data include pie charts, bar charts, and 2×2 tables. Consider the following categorical data, population by income group:
- Low-income: 50
- Middle-income: 130
- High-income: 30
Then, consider the following data, sales of different kinds of fruit:
- Grapes: 385,000
- Apples: 874,585
- Bananas: 45,575
Thes categorical data can be analyzed using a graphical toolbar chart as follows:
This bar chart helped us determine which fruit has the highest sales and which has the lowest sales. Apples have the highest sales because it has the highest bar, and bananas have the lowest sales because it has the lowest bar. Thus, the fruit seller must procure more apples and fewer bananas.
Moreover, consider the following categorical data:
There are 619 people in a class:
- 481 of them are male
- 138 of them are female
- 93 of them hire a home tutor
- 526 of them do not hire a home tutor
- 77 of them are male and hire a home tutor
- 16 of them are female and hire a home tutor
- 404 of them are male and do not hire a home tutor
- 122 of them are female and do not hire a home tutor
This categorical data could be depicted in a meaningful manner using a two-by-two table as:
Hired home tutor? | Hired home tutor? | ||
Yes | No | Total | |
Male | 77 | 404 | 481 |
Female | 16 | 122 | 138 |
Total | 93 | 526 | 619 |
Tools and Techniques: Categorical Dependent and Independent Variables
Chi-Square Test for Independence
One of the most common techniques used for analyzing the relationship between two categorical variables is the Chi-square test for independence. Chi is a Greek letter that looks like this: χ, so the test is sometimes referred to as The χ2 test for independence.
Here is an example of a Chi-square test of independence using the two-by-two table from the previous example.
Hired home tutor? | Hired home tutor? | ||
Yes | No | Total | |
Male | 77 | 404 | 481 |
Female | 16 | 122 | 138 |
Total | 93 | 526 | 619 |
Suppose we want to analyze the relationship between gender and the decision to hire a home tutor. We would conduct the Chi-square test of independence as follows to answer this question.
Step 1: Define the hypothesis
H0: Null Hypothesis: There is no relationship between the two categorical variables, ‘gender’ and ‘hired a home tutor,’ i.e., they are independent.
HA: Alternative Hypothesis: There is a relationship between the two categorical variables, ‘gender’ and ‘hired a home tutor,’ i.e., they are dependent.
Step 2: Compute the Chi-square (χ2) test statistic
Expected count = (Row total * Column total) / Grand total
Using the above formula, the expected counts for each cell are calculated as follows:
Expected counts
Hired home tutor? | Hired home tutor? | ||
Yes | No | Total | |
Male | (93 * 481) / 619 = 72.3 | (526 * 481) / 619 = 408.7 | 481 |
Female | (93 * 138) / 619 = 20.7 | (526 * 138) / 619 = 117.3 | 138 |
Total | 93 | 526 | 619 |
We already know the observed counts:
Hired home tutor? | Hired home tutor? | ||
Yes | No | Total | |
Male | 77 | 404 | 481 |
Female | 16 | 122 | 138 |
Total | 93 | 526 | 619 |
Thus, the χ2 test statistic is:
Step 3: Find the p-value
The p-value for the chi-square test for independence is the probability of getting counts like those observed, assuming that the two variables are not related (which would prove the null hypothesis). The smaller the p-value, the more surprising it would be to get counts like we did if the null hypothesis were true.
Technically, the p-value is the probability of observing χ2 at least as large as the one observed. Using statistical software, we find that the p-value for this test is 0.201.
Chi-Square Test: Yes, No
Yes | No | Total | |
Male | 77
72.27 0.310 |
404
408.73 0.055 |
481 |
Female | 16
20.73 1.081 |
122
117.27 0.191 |
138 |
Total | 93 | 526 | 619 |
- Expected counts printed below are observed, counts.
- Chi-Square contributions printed below are expected, counts.
Chi-square = 1.637, DF = 1, P-Value = 0.201
The p-value is higher than 0.05, so we fail to reject the null hypothesis at the 5% significance level. This means that gender and hiring a home tutor are independent variables.
Diagnostic odds ratio
This ratio is mostly used to test the effectiveness of a medicinal disease diagnostic test.
The diagnostic odds ratio test would be explained using an example. Consider the following two-by-two table:
Actual Condition | Actual Condition | ||
Positive | Negative | ||
Test Outcome | Positive | 44 | 23 |
Test Outcome | Negative | 6 | 96 |
That is, 44 patients actually had the disease, and the diagnostic test was a true positive (they really had the disease). Six patients actually had the disease, but the diagnostic test was a false negative (they really had the disease even though the test result indicated they did not). Twenty-three patients actually did not have the disease, but the diagnostic test was a false positive (they did not have the disease even though the test result indicated they did). Finally, 96 patients did not have the disease, and the diagnostic test was a true negative (they really did not have the disease).
We want to analyze the effectiveness of this diagnostic test with the diagnostic odds ratio test, as follows:
Step 1: Define the hypothesis
H0: Null Hypothesis: The diagnostic test is not effective/accurate.
HA: Alternative Hypothesis: The diagnostic test is effective/accurate.
Step 2: Compute the diagnostic odds ratio
D. O. R = (True positives / False positive) / (False negatives/True negatives) = (44 / 23) / (6 / 96) = 30.6
Step 3: Make the conclusion
The D. O. R of 30.6 is greater than 1, so we can safely conclude that the diagnostic test is effective/accurate.
Tools and Techniques: Only the Dependent Variable Is Categorical
OLS Regression
Ordinary least squares (OLS) regression reveals the probability of selecting an option. For example, suppose that the dependent variable is ‘loan default’; that is, a categorical variable which could take only two values: ‘Yes,’ if the person defaults, and ‘No,’ if the person does not default. The independent variables would be the person’s income level, job security, etc.
Now, suppose we ran an OLS regression on the data and found that the coefficient of the independent variable ‘income level’ is 0.45. This would imply that if the income level rises by 1 unit, then the probability of the person defaulting on his loan would decrease by 0.45.
Logistic Regression
The logistic regression reveals the log of the odds ratio of selecting an option. This would be explained using the same example that was used for explaining OLS regression.
Now, suppose we ran a logistic regression on the data and found out that the coefficient of the independent variable ‘income level’ is 0.30. This would imply that if the income level rises by 1 unit, then the log of the odds ratio of the person defaulting on his loan would increase by 0.30.
The logistic distribution is shown as follows:

Image: “Log-logistic distribution function.” by Qwfp. License: CC BY-SA 3.0
The logistic model accepts that information is case-particular; that is, every free factor has a solitary incentive for each case. The logistic model likewise expects that the response variable can’t be anticipated impeccably from the autonomous factors for any situation.
Similarly, as with different sorts of relapse, there is no requirement for the autonomous factors to be measurably free from each other (dissimilar to, for instance, in a naive Bayes classifier). However, collinearity is thought to be generally low because it is difficult to separate the effects of different factors if this isn’t the case.
Demonstrating decisions with the logistic model depends on the supposition of freedom of unimportant options. This supposition is problematic since the chances of leaning toward one class over another do not rely upon the nearness or non-appearance of other “insignificant” choices.
For instance, the relative probabilities of taking an auto or a bus to work do not change if a bike is included as an extra plausibility. This enables the decision of K contrasting options to be demonstrated as an arrangement of K-1 free independent decisions, in which one option is picked as a “turn,” and the other K-1 thought about against it, each one in turn.
The IIA speculation is center speculation in the normal decision hypothesis; however, various investigations in brain science demonstrates that people frequently abuse this supposition when settling on decisions.
Here the red bus choice was not in reality insignificant because the red bus was an ideal substitute for the blue bus. On the off chance that the logistic model is used to demonstrate decisions, it might, in a few circumstances, impose too much binding on the relative inclinations between the distinctive options.
This point is especially relevant if the investigator expects to foresee how decisions would change if one option were to vanish (for example, if one political candidate drops out of a three-candidate race). Different variations of the logistic model might be used as a part of such cases since they would take IIA infringement into account.
Probit Regression
The Probit method is similar to the logistic method. The difference is that these methods use different distributions. The logistic method uses logistic distribution, while the probit model uses z-distribution, as shown in the figure below.

Image: “Normal distribution pdf” by D.328. License: CC BY-SA 3.0
Tools and Techniques: Only the Independent Variable Is Categorical
Simple Regression
Suppose the independent variable of a particular study is gender (i.e., a categorical variable), and the dependent variable is earnings (i.e., a continuous variable). We could run a simple regression to analyze the impact of gender (i.e., a categorical variable) on earnings (i.e., a continuous variable).
Suppose that the categorical variable ‘gender’ is coded in a manner that, if gender is ‘male,’ then it is coded as 1 and, if the gender is ‘female,’ then it is coded as 0. If the coefficient turns out to be 63.48, then it would imply that, if the person is a male, then the earnings would be 63.48 higher than if the person is a female.