Normality by Skewness and Kurtosis
Normal distribution plays an important role in many statistical and machine learning tools. Such tools assume that variables are samples from normal distribution. Therefore, it is important to test if our dataset fits a normal distribution before using those tools. If not, we can then decide if we want to transform our dataset to normal, or change our tool. In this post I am going to share my recent experience on test of normality based on the skewness and kurtosis of my datasets. Through this sample I am going to review the following skills:
- Printing publication quality density plots of my feature variables with their best fitting normal distribution in R
- Getting skewness and kurtosis values for my feature variables in R
- Two normality tests based on skewness and kurtosis
Part 1. The Dataset
As a good practice let’s first take a look into our data. I am NOT going to do Exploratory Data Analysis (EDA) here. I am just plotting the density plots and their best normal fit. The dataframe contains percentages of specific Land cover types (class 2 and class 3) and 8 socioeconomic variables represented as the percentage of the specific population in that condition.

This is also a good practice to take a look into the summary of our data.

Part 2. Density Plots
We can now get plots of our data with their best fitting normal distribution.
library(ggpubr)
library(gridExtra)cols_names <- colnames(rural_HVI_withworkers)[2:length(rural_HVI_withworkers)]a <- list()for (i in cols_names){
a[[i]]<- ggdensity(rural_HVI_withworkers, x=i , fill = "lightgray")+
scale_x_continuous(limits = c(-0.2, 1.2))+
stat_overlay_normal_density(color="red", linetype="dashed")
print(i)
}do.call(grid.arrange,a)

Ok. Seems like we have a mixed situation. class2 and class3 show some skewness. But is it acceptable for our purpose, or should we treat that? Education, language, Race_noWhite, and outdoor_worker show sharper peaks. Those can be very well classified into different known kurtosis classes as below.

Part 3. Skewness and Kurtosis Values
Now let’s check the skewness and Kurtosis for each variable.

and the values for Kurtosis:

Part 4. Normality Test based on Skewness and Kurtosis
There is NOT a golden rule about the acceptable ranges of Skewness and Kurtosis. The best way is to use statistical methods:
I use Jarque–Bera test jarque.test(x) in moments package and the following results:

Only for two variables, Over60 and Over60_alone, the null hypothesis can not be rejected. Therefore the other 8 variables are significantly deviated from normal distribution.
There is also another test D’Agustino test for skewness. Let’s try that, too. I used agostino.test(x) in moments package for this.

Now, the outdoor_worker variable is also acceptable regarding its skewness. It can be expected looking into the figures. still seven other variables have skewness significantly different from an acceptable range to be accepted as normal.
Based on the analysis I decided to use another model that does not require the normality!
That’s it. I hope this post has been useful for you. I would appreciate your thoughts, suggestions and any comments you might have for this post.
Babak