Refresher - Chi-Square Test
Have you ever wondered if lottery numbers were evenly distributed or if some numbers occurred with a greater frequency? How about if the types of movies people preferred were different across different age groups? What about if a coffee machine was dispensing approximately the same amount of coffee each time? You could answer these questions by conducting a hypothesis test.
A chi-square test (pronounced Ki - square) is used to determine if there is independence of categorical data (i.e. if there an association between two categories of data). It is known by many other names: Pearson chi-square, cross tabulation, contingency table.
Note: There are other tests that use a chi-square distribution.
One is to see if there is a difference between two standard deviations, another is for goodness of fit.
However, the major use of a chi-square test is to determine if a statistically significant difference exists between the observed (or actual frequencies) and expected frequencies (or hypothesized, give the null hypothesis that no association exists) of variables presented in a cross-tabulation or contingency table; the larger the difference between the observed and expected frequency, the larger the chi-square statistic, and the more likely that the difference is significant.
Notation
The notation for the chi-square distribution is: x2~χdf2
where df=degrees of freedom depend on how chi-square is being used. (If you want to practice calculating chi-square probabilities then use df=n-1. The degrees of freedom for the three major uses are each calculated differently.)
For the x2 distribution, the population mean is μ=df and the population standard deviation is σ=√2df.
The random variable is shown as x2 but may be any upper case letter.
The random variable for a chi-square distribution with k degrees of freedom is the sum of k independent, squared normal variables.
Facts about the Chi-Square Distribution
- The curve is nonsymmetrical and skewed to the right.
- There is a different chi-square curve for each df .
- The test statistic for any test is always greater than or equal to zero.
- When df>90, the chi-square curve approximates the normal.
- The mean,μ, is located just to the right of the peak.
Goodness-of-Fit Test
In this type of hypothesis test, you determine whether the data "fit" a particular distribution or not. For example, you may suspect your unknown data fit a binomial distribution. You use a chi-square test (meaning the distribution for the hypothesis test is chi-square) to determine if there is a fit or not. The null and the alternate hypotheses for this test may be written in sentences or may be stated as equations or inequalities.
The test statistic for a goodness-of-fit test is: ∑n(O-E)2E
where:
- O = observed values (data)
- E = expected values (from theory)
- n = the number of different data cells or categories
The observed values are the data values and the expected values are the values you would expect to get if the null hypothesis were true. There are n terms of the form (O-E)2E.
The degrees of freedom are df=(number of columns-1)(number of rows-1).
The goodness-of-fit test is almost always right tailed. If the observed values and the corresponding expected values are not close to each other, then the test statistic can get very large and will be way out in the right tail of the chi-square curve.