Normality by Skewness and Kurtosis

Babak Fard
4 min readNov 21, 2020

--

Normal distribution plays an important role in many statistical and machine learning tools. Such tools assume that variables are samples from normal distribution. Therefore, it is important to test if our dataset fits a normal distribution before using those tools. If not, we can then decide if we want to transform our dataset to normal, or change our tool. In this post I am going to share my recent experience on test of normality based on the skewness and kurtosis of my datasets. Through this sample I am going to review the following skills:

  • Printing publication quality density plots of my feature variables with their best fitting normal distribution in R
  • Getting skewness and kurtosis values for my feature variables in R
  • Two normality tests based on skewness and kurtosis

Part 1. The Dataset

As a good practice let’s first take a look into our data. I am NOT going to do Exploratory Data Analysis (EDA) here. I am just plotting the density plots and their best normal fit. The dataframe contains percentages of specific Land cover types (class 2 and class 3) and 8 socioeconomic variables represented as the percentage of the specific population in that condition.

A scheme of our data frame sized 261x10

This is also a good practice to take a look into the summary of our data.

Summary of data, all standardized in range [0,1)

Part 2. Density Plots

We can now get plots of our data with their best fitting normal distribution.

library(ggpubr)
library(gridExtra)
cols_names <- colnames(rural_HVI_withworkers)[2:length(rural_HVI_withworkers)]a <- list()for (i in cols_names){
a[[i]]<- ggdensity(rural_HVI_withworkers, x=i , fill = "lightgray")+
scale_x_continuous(limits = c(-0.2, 1.2))+
stat_overlay_normal_density(color="red", linetype="dashed")
print(i)
}
do.call(grid.arrange,a)
Density Plot of the variables with their best normal fit plot

Ok. Seems like we have a mixed situation. class2 and class3 show some skewness. But is it acceptable for our purpose, or should we treat that? Education, language, Race_noWhite, and outdoor_worker show sharper peaks. Those can be very well classified into different known kurtosis classes as below.

Kurtosis distribution classes. source here

Part 3. Skewness and Kurtosis Values

Now let’s check the skewness and Kurtosis for each variable.

Skewness values

and the values for Kurtosis:

Kurtosis values

Part 4. Normality Test based on Skewness and Kurtosis

There is NOT a golden rule about the acceptable ranges of Skewness and Kurtosis. The best way is to use statistical methods:

I use Jarque–Bera test jarque.test(x) in moments package and the following results:

Results of Jarque-Bera test

Only for two variables, Over60 and Over60_alone, the null hypothesis can not be rejected. Therefore the other 8 variables are significantly deviated from normal distribution.

There is also another test D’Agustino test for skewness. Let’s try that, too. I used agostino.test(x) in moments package for this.

result of D’Agostino test of skewness

Now, the outdoor_worker variable is also acceptable regarding its skewness. It can be expected looking into the figures. still seven other variables have skewness significantly different from an acceptable range to be accepted as normal.

Based on the analysis I decided to use another model that does not require the normality!

That’s it. I hope this post has been useful for you. I would appreciate your thoughts, suggestions and any comments you might have for this post.

Babak

--

--