Finding Relationship Between Variables – ANOVA (Part 1)
ANOVA Test in Python
Finding the relationship between variables is a very important step in any statistical modeling. For example, you are working in a dataset which contains hundreds of variables but very few observations, you cannot simply include all those hundreds of variables in your modeling. Otherwise you will be violating those statistical concepts like Curse of Dimensionality (which says if you have n predictors your dataset should have 2^{n} observations, as a rule of thumb) or the Occam’s Razor (which says if you have two models with same output, choose the one which has less number of predictor variables, this concept is also known as Law of Parsimony). So, in this scenario, how will you choose which variables are important and which are not? There are many feature selection packages or statistical tests available in R, Python and in any other statistical software which will do that for you automatically. But, understanding what’s going on under the hood in those packages or tests will help you choose one technique over the other. In this article, I will explain what ANOVA test is and when it is used.
In general, we come up with only two types of variable in any dataset viz quantitative and categorical variable. And if you want to check the relationship between two variables, one being the predictor and other as a target variable. There can be different combinations predictor and target variables possible. The following table shows different types of tests which can be used to find a relationship between different predictor and the target variable.
Predictor Variable | Target Variable | |
Quantitative | Categorical | |
Quantitative | Pearson Correlation | Chi-Square Test* |
Categorical | ANOVA Test | Chi-Square Test |
*in this case first you need to categorize your quantitative predictor with only two levels and then you can apply the Chi-Square Test of independence.
Now let’s get back to the agenda of this article which is ANOVA test. From the table above you can see that the ANOVA test is used only when you want to find the relationship between a categorical predictor variable and a quantitative target variable. But how does the ANOVA work? It simply compares the means and variations of different groups (or levels) of the categorical variable and tests if there is any significant difference in their values. It yes, then we can say that there is an association (relationship) between the categorical predictor variable and quantitative target variable, otherwise not.
Let me explain you with an example, I have taken the Auto MPG DataSet from UCI Machine Learning Repository. This dataset contains information about the different aspects of automobile design and performance like mileage, number of cylinders, engine displacement etc for different car models. The research question here I am trying to answer, is the number of cylinders in a car (categorical predictor variables) associated or related to the mileage of the car (quantitative target variable)? Let’s now load this dataset and do some EDA
EDA (Exploratory Data Analysis)
#Loading the libraries import numpy as np import pandas as pd import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
#Loading the data data = pd.read_csv("auto-mpg.data", delim_whitespace=True, header=None, usecols=[0,1], names=['mpg', 'cylinders'], dtype={'mpg':np.float64, 'cylinders':'category'})
#View the data data.head(n=3)
#Summary of the variable mpg data.describe() #only summarize the numerical variables in the dataset
#Summary of the variable cylinders data['cylinders'].value_counts()
So, we can see that the categorical variable cylinders have five different levels with 3, 4, 5, 6 and 8 number of cylinders. Now let’s set the hypothesis for this research question.
Null Hypothesis: There is nothing going on between the variables, there is no relationship between the two variables cylinders and mpg. In other words, it does not matter how many cylinders the car has to accurately to predict the mileage of the car, the mean mpg for all the different levels of cylinders variable are same. In mathematical terms
Alternate Hypothesis: There is something going on between the predictor and target variable, or there is a relationship between the two. In other words, the number of cylinders in car affects the mileage of the car, the mean mpg for different groups of cylinder variable or at least one group mean is different from the other group means. But we don’t know which group mean is different, it might be a group with 3 number of cylinders or 4 number of cylinders or even with 8 number of cylinders. In mathematical terms:
After we have set the hypothesis, we will now try to find the means for all the different groups of the categorical variable cylinder. We also check how far apart the group means are from each other or in other words we measure the variation in the group means. If we observe that the group means are not close to each other we can say that we have evidence against the null hypothesis and vice versa. let try to answer this question through data visualization
Data visualization
#Visualizing data import matplotlib.pyplot as plt import seaborn as sns sns.boxplot(x=data['cylinders'], y=data['mpg'], showmeans=True) plt.show()
From the above plot, it is clearly visible that the mean (triangular shape in red color) of the group with 8 number of cylinders does not overlap with another group means. We can now say that we have evidence against the null hypothesis and the variables are related to each other. But there is an important question to answer “Are the differences among the group means are due to true differences between the group means of the population or it is merely due to sampling variability or by chance?”
To answer this above question, we do not only need to measure the variation among the group means but also, we need to measure the variation among the group means relative to the variation within the groups (you can visualize this by length the box of each group in the above boxplots, more is the length of the boxplot more is the variability in that group and vice versa). This led to the formula for F-statistic:
When the variation within a group is large, then the variation among the sample means become negligible. In that case, the data will provide less evidence against the null hypothesis. Similarly, when the variation within the group is small then the variation among group means dominates. And then the data provide strong evidence for the relationship between the predictor and target variables and we can reject the null hypothesis. So higher F statistic implies a relationship between the variables. But how high the F statistic should be? The answer to this question lies in the concept of the p-value. P- value is defined as the probability of getting that observed F-statistic (after the ANOVA test) or more extreme value of F-statistic provided the null hypothesis is true. So, if we get very less p-value, it means it is extremely rare to get that F-statistic when the null hypothesis is true (sometimes it may occur which is also known as Type I error) and we can reject the null hypothesis. Generally, we take the cutoff for p-value as 0.05 (which is 95% significance level). Let’s now do this in python
ANOVA F-Test
#ANOVA F Test model = smf.ols(formula='mpg ~ cylinders', data=data) results = model.fit() print (results.summary())
Don’t get overwhelmed by this above result. You only need to check the value for the F-statistic and Prob (F-statistic) values. F-statistic is very high at 172.6 with the very very low p-value. So, we can reject our null hypothesis and conclude that there is a relationship between the categorical predictor variable cylinder (number of cylinders in the car) and quantitative target variable mpg (mileage of the car).
We have now proved the relationship between the two variables, but the test is not over yet. We have proved the group means of the categorical variable cylinders are different from each other but we have not analyzed which group means are different which another group means. To find out this we will need to do the post-hoc analysis. This is the agenda for my next article, stay tuned till then. Happy learning.