## Introduction

Hypothesis testing is used to assess the plausibility of a hypothesis by analyzing study data.

For example, a company creates a new Drug X that is intended to treat hypertension. The company wants to know whether Drug X does in fact work to lower BP, so they need to do hypothesis testing.

**Steps for testing a hypothesis: **

- Formulate the hypothesis.
- Choose which statistical test you are going to use.
- Set the significance level.
- Calculate the test statistics from your data using the appropriate/chosen test.
- Conclusions:
- A decision is made to reject or not reject the null hypothesis from step 1.
- This decision is based on the predetermined levels of significance from step 3.

## Formulating a Hypothesis

A hypothesis is a preliminary answer to a research question (i.e., a “guess” about what the results will be). There are 2 types of hypotheses: the null hypothesis and the alternative hypothesis.

### Null hypothesis

- The null hypothesis (H
_{0}) states that there is no difference between the populations being studied (or put another way, there is no relationship between the variables being tested). - Written as a formula, H
_{0}: µ_{1}= µ_{2}, where µ represents the means (or average measurements) of groups 1 and 2, respectively - Example:

### Alternative hypothesis

- The alternative hypothesis (H
_{1}) states that there is a difference between the populations being studied. - Written as a formula, H
_{1}: µ_{1}≠ µ_{2} - Example:
- H
_{1}is a statement that researchers think is true.

### What is the study really testing?

- A hypothesis can never be conclusively confirmed, but it can be conclusively rejected.
- Therefore, the alternative hypothesis cannot be directly confirmed or rejected.
- Instead, a research study will reject or fail to reject the null hypothesis.

### Examples

**Example 1: rejecting the null hypothesis**

In the example above, if the findings of the trial show that Drug X does in fact significantly lower BP, then the null hypothesis (postulating that there is no difference between the groups) is rejected. Note that these findings do not** **confirm the alternative hypothesis, but because many alternative hypotheses are also possible, they only reject the null hypothesis.

**Example 2: failing to reject the null hypothesis**

In the example above, if the findings of the trial show that Drug X did not significantly lower BP, then the study failed to reject the null hypothesis. Again, note that the findings do not confirm the null hypothesis.

### Types of errors and power

**Type I error:**- The null hypothesis is true, but is rejected.
- The chance of committing a type I error is represented as α.

**Type II error:**- The null hypothesis is false, but is accepted/not rejected.
- The chance of committing a type II error is represented as β.

**Power:**- The probability that a test will correctly reject a false null hypothesis
- Power = 1 – β
- Power depends on:
- Sample size (e.g., higher sample size → ↑ power)
- Size of expected effect (e.g., higher/larger expected effect → ↑ power)

## Determining Statistical Significance

Statistical significance is the idea that all test outcomes are highly unlikely to be produced simply by chance. To determine statistical significance, you need to set an α-value and calculate a p-value.

### P-values

A graph can be created in which possible study results are plotted on the x-axis and the probability of observing each result are plotted on the y-axis. The area under the curve represents the p-value.

- The p-value is the probability of obtaining a given result, assuming the null hypothesis is true.
- In other words, the p-value is the probability that you would get this result if there was no relationship between the variables and that the results occurred simply by chance.
- Like all probabilities, the p-value is between 0 and 1.

- Higher p-values (larger areas under the curve):
- Indicate a higher likelihood that the null hypothesis is true
- Suggests that there is no relationship between your variables
- Example: In the example above, a p-value of 0.6 would mean it is unlikely that Drug X is associated with lower BP.

- Lower p-values (smaller areas under the curve):
- Indicate a low
- Suggests that an observed correlation between your variables is unlikely to be due simply to chance and that a true relationship likely exists
- Example: In the example above, a p-value of 0.02 suggests that Drug X is associated with lower BP.

- Indicate a low
- If the p-value is lower than your predetermined level of significance (α-level), you can reject the null hypothesis, because there likely is a real relationship between your variables.
- The lower the p-value, the more confident you can be that the relationship between your variables is true (and are not due to chance).

**Mnemonic:**

“If the p is low, the null (hypothesis) must go.”

### α-level

- The α-level is a p-value that represents an arbitrarily determined “significance level.”
- The α-level should be chosen prior to conducting a study.
- By convention, the α-level is typically set at 0.05 or 0.01.
- The α-level is the risk you are willing to take of making a wrong decision, in which you incorrectly reject the null hypothesis (when it is in fact true).
**Example:**- An α-level of 0.05 means you will conclude that a relationship between your variables exists if the p-value is < 0.05.
- This means you are willing to accept up to a 5% chance of committing a type 1 error.

- In the Drug X BP example, if the p-value was 0.03, then you would conclude that:
- Drug X is associated with lower BP → this is a rejection of the null hypothesis
- There is a 3% chance you have committed a type 1 error: that the null hypothesis was in fact true and Drug X is not actually associated with lower BP.

### Confidence intervals

- A CI is the probability that your result falls between a defined range of values.
- CIs measure the degree of uncertainty in sampling.
- The CI is the range of means you would get from repeatedly sampling the same population over and over.
- CIs are calculated using the sample size, the sample’s mean, and the standard deviation (online calculators and standard tables are typically used).

- The confidence level for CIs is the probability that the CI contains the true result
- Most commonly, a 95% confidence level is used (though the confidence level often ranges from 90% to 99%)
- A 95% CI is a range of values that are 95% certain to contain the true mean of the population.
- Like the α-level, the CI confidence level is chosen prior to testing the data.
- The higher the confidence needed, the larger the interval will be.

- Example:
- A mean height of 70 inches is found.
- The 95% CI is calculated to be between 68 and 72 inches.
- This means that if the researchers take 100 random samples from that same population, 95% of the time, the mean will fall between 68 and 72 inches. (It does not mean that 95% of the data in that 1 sample are between 68 and 72 inches.)
- If a higher level of confidence is desired, the range will widen; for example, a 99% CI may result in a CI of 66 to 74 inches.

### Pitfalls in hypothesis testing

- Do not base your hypothesis on what you see in the data.
- Do not make your H
_{0}what you want to show to be true. - Check the conditions.
- Do not accept the H
_{0}, instead fail to reject it. - Do not confuse practical significance and statistical significance (e.g., with a large enough sample size, you may find that Drug X lowers systolic BP by 2 mm Hg. Even if this is statistically significant, is this clinically significant for your patient?)
- If you fail to reject the H
_{0}, do not assume that a larger sample size will lead to rejection. - Be sure to think about whether it is reasonable to assume that events are independent.
- Do not interpret p-values as the probability that the H
_{0}is true. - Even a test carried out perfectly can be wrong.

## Statistical Tests

### Choosing the right test

Your choice of test is based on:

- The types of variables you are testing (both your test “exposure” and your “outcome”)
- Quantitative: continuous (age, weight, height) versus discrete (number of patients)
- Categorical: ordinal (rankings; e.g., grades, clothing size), nominal (groups with names; e.g., marital status), or binary (data with only a “yes/no” answer; e.g., alive or dead)

- Whether or not your data meet certain criteria known as assumptions; common assumptions include:
- Data points are all independent of one another.
- Variance within a single group is similar among all groups.
- Data follow a normal distribution (bell curve).

The reasonability of the model should always be questioned. If the model is wrong, so is everything else.

Be careful of variables that are not truly independent.

### Types of tests

The 3 primary categories of statistical tests are:

- Regression tests: assess cause-and-effect relationships
- Comparison tests: compare the means of different groups (require quantitative outcome data)
- Correlation tests: look for associations between different variables

Test name | What the test is testing | Types of variables/data | Example |
---|---|---|---|

Regression tests | |||

Simple linear regression | How a change in the predictor/input variable affects the outcome variable | - Predictor: continuous
- Outcome: continuous
| How does weight (predictor) affect life expectancy (outcome)? |

Multiple linear regression | How changes in the combinations of ≥ 2 predictor variables can predict changes in the outcome | - Predictor: continuous
- Outcome: continuous
| How do weight and socioeconomic status (predictors) affect life expectancy (outcome)? |

Logistic regression | How ≥ 1 predictor variables can affect a binary outcome | - Predictor: continuous
- Outcome: binary
| What is the effect of weight (predictor) on survival (binary outcome: dead or alive)? |

Comparison tests | |||

Paired t-test | Compares the means of 2 groups from the same population | - Predictor: categorical
- Outcome: quantitative
| Compare the weights of infants (outcome) before and after feeding (predictor). |

Independent t-test | Compares the means of 2 groups from different populations | - Predictor: categorical
- Outcome: quantitative
| What is the difference in average height (outcome) between 2 different basketball teams (predictor)? |

Analysis of variance (ANOVA) | Compares the means from > 2 groups | - Predictor: categorical
- Outcome: quantitative
| What is the difference in blood glucose levels (outcome) 1, 2, and 3 hours after a meal (predictors)? |

Correlation tests | |||

Chi-square test | Tests the strength of association between 2 categorical variables with a larger sample size | - Variable 1: categorical
- Variable 2: categorical
| Compare whether acceptance into medical school (variable 1) is more likely if the applicant was born in the United Kingdom (variable 2). |

Fisher’s exact test | Tests the strength of association between 2 categorical variables with a smaller sample size | - Variable 1: categorical
- Variable 2: categorical
| Same as chi-square, but with smaller sample sizes |

Pearson r test | Tests the strength of association between 2 continuous variables | - Variable 1: continuous
- Variable 2: continuous
| Compare how plasma HbA_{1c} level (variable 1) is related to plasma triglyceride levels (variable 2) in diabetic patients. |

### Chi-square test (χ^{2})

Chi-square tests are commonly used to analyze categorical data and determine whether 2 categorical variables are related.

- What chi-square tests can assess:
- Whether or not a statistically significant association is present between 2 variables
- Analyzed data: typically “counted” categorical data, meaning you have a number of named categories, and your data points are the counted values for each category.
- More accurate on large samples than Fisher’s exact test

- What chi-square tests cannot
- The strength of that association
- Whether the relationship is causal

In order to perform a chi-square test, 2 pieces of information are needed: the degrees of freedom (number of categories minus 1), and the α-level (which is chosen by the researcher and usually set at 0.05). In addition, the data should be organized in a table.

Example:** **If you wanted to see whether jugglers were more likely to be born during a particular season, the data could be recorded in the following table:

Category (i): season of birth | Observed frequency of jugglers in each birth season |
---|---|

Spring | 66 |

Summer | 82 |

Fall | 74 |

Winter | 78 |

To begin, the expected frequencies for each cell in the table above need to be determined using the equation:

$$ Expected\ frequency = np_{0i} $$where *n* = the sample size and *p*_{0}* _{i}* is the hypothesized proportion in each category

*i*.

In the above example, *n* = 300 and *p*_{0}* _{i}* is ¼, so the expected cell frequency is 300 * 0.25 = 75 in each cell.

The** **test statistic is then calculated by the standard chi-square formula:

where 𝝌^{2} is the test statistic being calculated. For each “cell” or category, the expected frequency is subtracted from the observed frequency; this value is squared and then divided by the expected frequency. After this number is calculated for each category, the numbers are added together.

**Example 𝝌 ^{2} calculation:** Using the example above, the expected frequency in each cell is 75, so the 𝝌

^{2}

**test statistic can be calculated as follows:**

_{ }Category (i): season of birth | Observed frequency of jugglers with each birth season | (Observed – expected)^{2}/expected |
---|---|---|

Spring | 66 | (66 ‒ 75)^{2} / 75 = 1.08 |

Summer | 82 | (82 ‒ 75)^{2} / 75 = 0.653 |

Fall | 74 | (74 ‒ 75)^{2} / 75 = 0.013 |

Winter | 78 | (78 ‒ 75)^{2} / 75 = 0.12 |

**𝝌 ^{2}_{ }= 1.08 + 0.653 + 0.013 + 0.12 = 1.866**

**Determining whether or not the test statistic is statistically significant:**

To determine whether this test statistic is statistically significant, the chi-square table is used to obtain the chi-square critical number.

- The table has degrees of freedom (number of categories minus 1) on the y-axis and the α-level on the x-axis.
- Using the degrees of freedom and α-level from the study, you find the critical number on the chart (see example chart below).
- The critical number is used to determine statistical significance by comparing it to the test statistic.
**If the test statistic > critical value:**- The observed frequencies are far away from expected frequencies
- Reject the null hypothesis in favor of the alternative hypothesis based on this α-level.

**If the test statistic < critical value:**- The observed frequencies were close to the expected frequencies
- Do not reject the null hypothesis based on this α-level.

**Example 𝝌 ^{2}**

**test:**Are jugglers more likely to be born in a particular season at a 0.05 significance level?

- There are 4 different seasons, so there are 3 degrees of freedom.
- α-level = 0.05
- Using the table above, the critical number is 7.81
- Therefore, we will reject our null hypothesis if the test statistic is > 7.81.

Category (i): season of birth | Observed frequency of jugglers with each birth season | (Observed ‒ expected)^{2}/expected |
---|---|---|

Spring | 66 | (66 ‒ 75)^{2} / 75 = 1.08 |

Summer | 82 | (82 ‒ 75)^{2} / 75 = 0.653 |

Fall | 74 | (74 ‒ 75)^{2} / 75 = 0.013 |

Winter | 78 | (78 ‒ 75)^{2} / 75 = 0.12 |

**𝝌**^{2}_{ }**= 1.08 + 0.653 + 0.013 + 0.12 = 1.866**

Since 1.866 is < 7.81 (our critical value), we need to fail to reject (i.e., accept) the null hypothesis and conclude that season of birth is not associated with juggling.

**Common pitfalls:**

- Do not use chi-square unless the data are counted.
- Beware of large sample sizes, as degrees of freedom do not increase.

### Fisher’s exact test

Similar to the 𝝌^{2 }test, the Fisher’s exact test is a statistical test used to determine whether there are nonrandom associations between 2 categorical variables.

- Used to analyze data found in contingency tables and determine the deviation of data from the null hypothesis (i.e., the p-value)
- For example: comparing 2 possible “exposures” (smoking versus not smoking) with 2 possible outcomes (develops lung cancer versus healthy)
- Contingency tables may have > 2 “exposures” or > 2 outcomes

- More accurate for small data sets
- Fisher’s test gives exact p-values based on the table.
- Complicated formula to calculate the test statistic, so typically calculated with software.

A 2 × 2 contingency table is set up like this:

Y | Z | Row total | |
---|---|---|---|

W | A | B | A + B |

X | C | D | C + D |

Column total | A + C | B + D | A + B + C + D (= n) |

The test statistic, *p*, is calculated from this table using the following formula:

where *p* = p-value; A, B, C, and D are numbers from the cells in a basic 2 × 2 contingency table; and* n* = total of A + B + C + D.

### Related videos

## Graphical Representation of Data

### Purpose

Before any calculations are made, data should be presented in a simple graphical format (e.g., bar graph, scatter plot, histogram).

- The characteristics of the distribution of data will indicate the statistical tools that will be needed for analysis.
- Graphs are the 1st step in data analysis, allowing for the immediate visualization of distributions and patterns, which will determine the next steps of statistical analysis.
- Outliers can be an indication of mathematical or experimental errors.
- There are many ways to graphically represent data.
- After calculations are completed, visual presentation can assist the reader in conceptualizing the results.

### Displaying a relationship between variables

**Contingency tables:**

- Tables showing the relative frequencies of different combinations of variables
- Example: Comparing the results of a screening test (positive or negative) with whether or not people actually have a disease. (Note: This specific type of contingency table can be used to calculate the sensitivity and specificity of a screening test.)

**Scatter diagram or dispersion diagrams:**

- A method commonly used to display the relationship between 2 numerical variables or 1 numerical variable and 1 categorical variable
- The dots represent the values of individual data points.
- Allows for calculation of a “best fit line” representing the data as a whole
- Allows for easy visualization of the entire data set
- Example: scatter diagram showing the relationship between 2 numerical variables

**Box plots:**

- Shows the spread and centers of the data set
- Visually expresses a 5-number summary:
- The minimum value is shown at the end of the left of the box.
- The first quartile (Q
_{1}) is at the far left of the box. - The median is shown as the line in the center of the box
- The third quartile (Q
_{3}) is at the far right of the box. - The maximum value is shown at the end of the right of the box.

- Typically used when comparing means and distributions between 2 populations
- Example: The following box plot compares the average incubation periods between different variants of the novel coronavirus (nCoV), SARS, and Middle East respiratory syndrome (MERS).

**Kaplain-Meier survival curves**

- A type of statistical analysis used to estimate the time-to-event data—typically, survival data.
- Commonly used in medical studies showing how a particular treatment can affect/prolong survival.
- The line represents the number of patients surviving (or who have not yet achieved a certain end point) at a given point in time.
- Example: The survival curve below shows how 2 different gene signatures affect survival. The study begins at time point 0, with 100% of the 2 groups surviving. Each drop-off in the line represents people dying in each group, decreasing the percentage of people who remain living. After 3 years, approximately 50% of people with the Gene A signature are still alive, compared with only 5% who have the Gene B signature.

### Presentation of numerical variables

**Tables **(a frequency table is 1 example):

- The most simple form of graphing data
- Data are displayed in columns and rows.

**Histograms:**

- Good for demonstrating the results of continuous data, such as:
- Weights
- Heights
- Lengths of time

- Similar to, but not the same as, bar graphs (which display categorical data)
- A histogram display divides the continuous data into intervals or ranges.
- The height of each bar represents the number of data points that fall into that range.
- Because histograms are representing continuous data, they are drawn with no gaps between bars.
- Example: A histogram showing how many people lost or gained weight over a 2-week study period. In this example, 1 person lost between 2.5 and 3 pounds, 27 people gained between 0 and 0.5 pounds, and 5 people gained between 1 and 1.5 pounds.

**Frequency polygon charts:**

- A frequency polygon graph plots the frequencies of each data point (or range in a histogram) and connects them with a line.
- Good for understanding the shape of a distribution

### Presentation of categorical variables

Frequency tables, bar charts/histograms, and pie charts are 3 of the most common ways to present categorical data.

**Frequency tables: **

- Display numbers and/or percentages for each value of a variable
- Example: Pull up to 100 different stoplights and record whether the light was red, yellow, or green upon your arrival.

Stoplight color | Frequency |
---|---|

Red | 65 |

Yellow | 5 |

Green | 30 |

**Bar graph:**

- The length of each bar indicates the number or frequency of that variable in the data set; bars can be plotted vertically or horizontally
- Example: A bar graph showing the breakdown of race/ethnicity in Texas in 2015.

**Pie charts:**

- Demonstrates relative proportions between different categorical variables
- Example: The following pie chart shows the results of the European Parliament election in 2004, with each color representing a different political party and the percentage of votes they received.

## References

- Greenhalgh, T. (2014). How to Read a Paper: The Basics of Evidence-Based Medicine. Chichester, UK: Wiley.
- Cochran, W. G. (1952). The chi-square test of goodness of fit. Annals of Mathematical Statistics 23(3):315–345.
- Yates, F. (1934). Contingency table involving small numbers and the χ2 test. Supplement to the Journal of the Royal Statistical Society 1(2):217–235.
- Kale, A. (2009). Chapter 2 of Basics of Research Methodology. Essentials of Research Methodology and Dissertation Writing, 7–14.
- Till, Y., Matei, A. (n.d.). Basics of Sampling for Survey Research. SAGE Handbook of Survey Methodology, pp. 311–328.
- Shober, P. et al. (2018). Statistical significance versus clinical importance of observed effect sizes: what do p values and confidence intervals really represent? Anesthesia & Analgesia 126:1068–1072.
- Katz, D. L., et al. (Eds.), Jekel’s Epidemiology, Biostatistics, Preventive Medicine, and Public Health, pp. 105–118. Retrieved July 8, 2021, from https://search.library.uq.edu.au/primo-explore/fulldisplay?vid=61UQ&search_scope=61UQ_All&tab=61uq_all&docid=61UQ_ALMA2193525390003131&lang=en_US&context=L