Chi-square in R

By Synnøve Yndestad in R Statistics Clinical science

August 7, 2021

The Chi-square test is used to compare differences between two or more categorical variables. All variables must be ordinal or nominal and summarized as a frequency table. It is a non-parametric test, meaning that it is suitable also for data that is not normally distributed.

Some of the assumptions for performing a Chi-square test are:

  • Each observation is independent of all the others (one observation per subject), and the categories must be mutually exclusive so that a subject fits into only one of the categories. I.e Small/Medium/Large, or Mutated/Wild-type.
  • The value of the expected count should be >=5 in minimum 80% of the cells, and no cells should have 0 or negative counts.

For more details, see PMID: 23894860

To run a Chi-square test i R, we first need some data. The “esoph” dataset in R´s datasets-package contains data from a case control study of oesophaeal cancer, where records for 88 age-group/alcohol-consumption/tobacco-consumption combinations are listed with the frequency of number of cases (ncases) and number of controls (ncontrols).

Fetch the esoph-dataset, take a look of the data-structure and data.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
data(esoph)
head(esoph)
##   agegp     alcgp    tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day      0        40
## 2 25-34 0-39g/day    10-19      0        10
## 3 25-34 0-39g/day    20-29      0         6
## 4 25-34 0-39g/day      30+      0         5
## 5 25-34     40-79 0-9g/day      0        27
## 6 25-34     40-79    10-19      0         7
str(esoph)
## 'data.frame':    88 obs. of  5 variables:
##  $ agegp    : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ alcgp    : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ tobgp    : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ ncases   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ncontrols: num  40 10 6 5 27 7 4 7 2 1 ...

The data structure tells us that the age-group/alcohol-consumption/tobacco-consumption columns are ordered factors with 6, 4 and 4 levels respectively, while the cases and control columns are numerical.

DT::datatable(esoph, options = list(searching = FALSE))

Some data transformation is necessary. Lets make a frequency table for alcohol consumption. Then we need to group by alcohol consumption level, and summarize cases and controls within each group.

alcohol <- esoph %>% select(c(-"tobgp", -"agegp")) %>%        # Remove unnecessary columns
                     group_by(alcgp) %>%                      # Group by alcohol consumption levels
                     mutate(ncases = sum(ncases)) %>%         # Summarize cases of cancer within alcohol consumption level-groups
                     mutate(ncontrols = sum(ncontrols)) %>%   # Summarize cases of controls within alcohol consumption level-groups
                     distinct()                               # Remove duplicate values
alcohol <- as.data.frame(alcohol)

Turn the alcgp-column into column names:

alcgp <- alcohol$alcgp
row.names(alcohol) <- alcgp
alcohol <- alcohol[,-1]
alcohol
##           ncases ncontrols
## 0-39g/day     29       386
## 40-79         75       280
## 80-119        51        87
## 120+          45        22
DT::datatable(alcohol, options = list(searching = FALSE))

Now, perform the Chi-square test on the frequency table and save test results as “alcoholChi”:

alcoholChi <- chisq.test(alcohol)
alcoholChi
## 
##  Pearson's Chi-squared test
## 
## data:  alcohol
## X-squared = 158.95, df = 3, p-value < 2.2e-16

The p-value is 2.2e-16, and considered significant. We can therefore assume that there is an association between alcohol consumption and oesophaeal cancer.

If the chi squared test is significant we need to take a look at the residuals to figure out the nature of the association.

Print the observed and expected counts from the test, and the residuals:

# observed counts 
alcoholChi$observed  
##           ncases ncontrols
## 0-39g/day     29       386
## 40-79         75       280
## 80-119        51        87
## 120+          45        22
# expected counts under the null hypothesis
alcoholChi$expected  
##             ncases ncontrols
## 0-39g/day 85.12821 329.87179
## 40-79     72.82051 282.17949
## 80-119    28.30769 109.69231
## 120+      13.74359  53.25641
# Pearson residuals  (observed - expected) / sqrt(expected).
alcoholChi$residuals  
##               ncases  ncontrols
## 0-39g/day -6.0833726  3.0903564
## 40-79      0.2554039 -0.1297453
## 80-119     4.2650726 -2.1666591
## 120+       8.4311925 -4.2830501

When the Pearson residuals is above 2, the cells observed frequency is more than the expected frequency. When the Pearson residuals is less than -2, the cells observed frequency is less than the expected frequency. This enables us to see which cells are contributing for the association, and in which direction (negative or positive).

To visualize associations, we can use the nifty assocplot() function. It plots a “Cohen–Friendly association plot and it takes a contingency table as input, and displays the deviations from independence as a bar chart.

See ??assocplot() for details.

assocplot read data as matrix. Therefore, transform the data.frame into matrix inside the call:

assocplot(as.matrix(alcohol), xlab = "Alcohol consuption", main="Association between alcohol consumption and oesophaeal cancer")

The height of the box represents the contribution of each cell into the total Chi-squared.
The area of the box is proportional to the difference in observed and expected frequencies. Thus, a wide and tall box indicates a stronger association.

According to these results, there is a clear association between high alcohol consumption and oesophaeal cancer. Does this mean that we should stop drinking wine completely?

According to Av og til, 1 unit of alcohol is 12g. That is in more commonly known units;

  • One bottle of beer (33 cl) 4.5 vol %
  • A small glass of wine (12.5 cl) 12 vol %
  • A very small glass of liquor (4 cl) 40 vol %

The association was significant for cases with an alcohol-consumption of more than 80g/day. This equals more than 6 units a day. 6 units is approximately 2.2 L beer, or 83.25 cl wine (i.e more than a bottle (75cl)), or 26.64 cl liquor every day. That amount would be unhealthy also beyond increased cancer risk. So if you enjoy your Friday evening glass, have no worries, as long it is in moderation.

When not to use Chi-square

In some cases it is inappropriate to use the Chi-square test.

Using data listed in table two we could try to test if patients that have tumors with HR mutations responds better to treatment than wild-type tumors that is able to repair DNA by HR.

First, create a contingancy table from vectors.

HR_mutation <- c(10, 1, 0)  
Wt <- c(8, 10, 3)
df = rbind(HR_mutation, Wt)             # Bind vector by rows to a data frame
colnames(df) <- c("CR+PR", "SD", "PD")  # Add response values as column names
df                                      # print data frame
##             CR+PR SD PD
## HR_mutation    10  1  0
## Wt              8 10  3
chisq.test(df)
## Warning in chisq.test(df): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  df
## X-squared = 8.2683, df = 2, p-value = 0.01602

As we see from the output warning, the Chi-squared approximation may be incorrect. This is because this test is sensitive to sample size. A common rule of thumb is to have no cells with an expected count of zero, 5 or more in all cells in 2x2 tables or that 80% of cells have more than 5 in larger tables. In this case a Fisher test (after transforming the data to a 2x2 table) is more appropriate due to small sample size, or a test for trends in proportions, as the values are ordinal.

When the number of categories for each variable becomes very large (more than 20) it becomes difficult to interpret the results. Also, when the sample size is very large (>500) any small difference can produce significant results causing Type 1 errors or false positives. Just because a difference is significant, the difference may be so small that it is biologically irrelevant. So interpret your results with care.

Posted on:
August 7, 2021
Length:
7 minute read, 1335 words
Categories:
R Statistics Clinical science
Tags:
Chi-square Categoricals Crosstab Non-parametric assocplot() chisq.test()
See Also:
Plotting bar charts in R, geom_bar vs geom_col
For loop for Multiple Trend in Proportions
How to 'Pivot Wider' when you have only character values