Learn about about Chi-sqaure using R


Chi-square is typically used to test the relationship between two categorical variables. For example, does a car have an automatic transmission is a categorical variable. The age of an individual is a continous variable. Chi-square is for testing relationships between variables like the former. Thus our Null hypothesis (H0) would be: There is no significant relationship between two categorical variables. And our Alternative hypothesis (H1) would be: There is a significant relationship between two categorical variables.

Typically your variables can be visualized in a bivariate table that looks like :

mytable = table(mtcars$am,mtcars$vs)
##      0  1
##   0 12  7
##   1  6  7

Where, “0” and “1” are different categories. In R, you don’t need a table like this to run Chi-sqaure, it’s just a helpful data visulization. You can simply call on the two variables in question, like I demonstrate below.

Let’s take a look at the Chi-square equation. $$X^2 = \sum_{ij} \frac{(o_{ij}-e_{ij})^2}{e_{ij}}$$ It’s the sum of the observerd value minus the expected value if no relationship existed, squared, divided by the expected value.

This Chi-squre statistic is compared against a critical value where $df = (r - 1)(c-1)$. r is the number of rows in the contigency table and c is the number of column in the contingency. Just like t and F statistics, this $X^2$ statistic is looked up in a table to see if it’s significant or not.

We’ll be using the mtcars dataset again. Really an all around great dataset for exploring different models and data visulization. I’ll trim down the dataset to the variables of interest and then perform the Chi-square test.

mtcars = mtcars %>% 
  select("am", "vs")
chisq.test(mtcars$am, mtcars$vs)
##  Pearson's Chi-squared test with Yates' continuity correction
## data:  mtcars$am and mtcars$vs
## X-squared = 0.34754, df = 1, p-value = 0.5555

We can conlude there is no significant statistical relation between transmission type and engine type, thus accepting the H0.

Things to know about Chi-square:

  1. Large sample sizes (about 500) will almost always reveal a significant effect.
  2. Low numbers in one cell e.g. <5.
  3. Chi-square is a non-parametric test meaning it has almost no assumptions about how the data is distributed. There are assumptions about the data e.g. categorical, independence, and participants can only contribute to one cell.

In the next post, I’ll be moving on from basic statistics and will start on Bayesian modelling. The next post will be about Markov chains


Mohan Gupta
Postdoctoral Scholar

My research interests include the what are the best ways to learn, why those are the best ways, and can I build computational models to predict what people will learn in both motor and declarative learning .