# Introduction

Before we get deep into transforming skewed data, let’s quickly talk around the normal distribution and skewness coefficient, or have a look at my previous post to have a detailed insight on different types of distribution of data here.

The normal distribution is a statistical concept that denotes the probability distribution of data which has a bell-shaped curve. That is, the normal distribution is symmetrical on both sides where mean, median, and mode are equal. The skewness coefficient of a normal distribution is 0 that can be used as a reference to measure the extent and direction of deviation of the distribution of a given data from the normal distribution.

A positively skewed data has a skewness of greater than 0, whereas the negatively skewed data has a skewness of lower than 0. In other words, the data points tend to concentrate around the lower values in a positively skewed data and the mean is greater than the median, where the opposite is true in a negatively skewed data.

Oftentimes the real-life data do not follow the normal distribution which undermines the validity of the parametric statistical models and inferences. Thus, some sort of modification is needed. That is where the data transformation comes in to make the distribution of the data as normal as possible so that the statistical inferences drawn from the data become more valid by reducing the skewness of the original data.

In this post, I try to cover the most common methods of transforming a skewed distribution into a normal distribution, and the foundational step you must consider before deciding which method to apply.

  library(data.table) # Enhanced data.frame
library(dplyr) # Data manipulation
library(tidyr) # Reshape the data between wide and long formats
library(ggplot2) # Data visualization
library(patchwork) # Re-arrange ggplot2 plots
library(moments) # Calculating skewness coefficient
library(gt) # Creating an object of class "gt" table

# Dealing with the positive skewness

The common methods for transforming positively skewed data to normal distribution are, but not limited to, square root transformation, cube root transformation, and log transformation.

I use the GDP per capita as an example. You can download the data from here.

Country NameIndicator Namevalue
1AfghanistanGDP per capita (current US$)502.1 2AngolaGDP per capita (current US$)2,973.6
3AlbaniaGDP per capita (current US$)5,352.9 4AndorraGDP per capita (current US$)40,886.4
5Arab WorldGDP per capita (current US$)6,580.1 6United Arab EmiratesGDP per capita (current US$)43,103.3
7ArgentinaGDP per capita (current US$)10,006.1 8..214 215VietnamGDP per capita (current US$)2,715.3
216VanuatuGDP per capita (current US$)3,058.1 217SamoaGDP per capita (current US$)4,315.9
218KosovoGDP per capita (current US$)4,417.5 219South AfricaGDP per capita (current US$)6,001.4
220ZambiaGDP per capita (current US$)1,291.3 221ZimbabweGDP per capita (current US$)1,464.0
Source: The World Bank
Last update: 2020-07-01.

Let’s illustrate the distribution of the above data to see what we are dealing with.

df <- select(gdp_pc, gdp_pc = value)

ggplot(df, aes(x = gdp_pc)) +
geom_histogram(aes(y = ..density..), fill = "coral", color = "gray", alpha = 0.8) +
geom_density() +
theme_bw() + labs(x = NULL, title = 'GDP Per Capita')

It certainly does not follow the normal distribution and is positively skewed and most of the data points tend to concentrate around the lower values. That is, most countries have a lower GDP per capita.
The extent of the skewness, however, is not known yet. To measure the skewness you can use skewness() function from the moments package.

skewness(df\$gdp_pc)
## [1] 2.096777

Since the skewness coefficient is greater than one, it denotes the data is highly skewed. Let’s transform the data into normal distribution using different methods.

1) Square root transformation: The square root transformation involves converting each value to its square root. That is, converting $${(x)}$$ to $${({x}^{1/2})}$$.
2) Cube root transformation: This method involves converting $${(x)}$$ to $${(x^{1/3})}$$.
3) Log transformation: Different types of logarithmic transformation can be applied, including $$x$$ to $${log_{10}x}$$; $$x$$ to $${log_{2}x}$$, $$x$$ to $${log_{e}x}$$ or $${ln}\ {x}$$.

df <- df %>%
mutate(sqrt_gdp_pc = sqrt(gdp_pc),
cbrt_gdp_pc = (gdp_pc)^(1/3),
log_gdp_pc = log(gdp_pc)
)

do.call(rbind, (lapply(df, skewness))) %>%
data.table(keep.rownames = T) %>%
mutate(Method = factor(case_when(rn %in% "gdp_pc" ~ 1, rn %in% "sqrt_gdp_pc" ~ 2, rn %in% "cbrt_gdp_pc" ~ 3, rn %in% "log_gdp_pc" ~ 4),
labels = c("Original data", "Square Root Transformation", "Cube Root Transformation", "Log Transformation"),
ordered = T)) %>%
gt() %>%
cols_hide(columns = vars(rn)) %>%
cols_move_to_start(columns = vars(Method)) %>%
# tab_row_group(group = "Methods to reduce skewness", rows = 2:4) %>%
# tab_row_group(group = "Original Dat", rows = 1) %>%
fmt_number(columns = vars(V1),decimals = 3) %>%
cols_label(V1 = "Skewness Coefficient") %>%
tab_style(style = cell_fill(color = "maroon", alpha = 1), locations = cells_body(columns = vars(V1), rows = V1 > 2)) %>%
tab_style(style = cell_fill(color = "maroon", alpha = 0.7), locations = cells_body(columns = vars(V1), rows = V1 > 1 & V1 < 2)) %>%
tab_style(style = cell_fill(color = "maroon", alpha = 0.5), locations = cells_body(columns = vars(V1), rows = V1 > 0.5 & V1 < 1)) %>%
tab_style(style = cell_fill(color = "maroon", alpha = 0.3), locations = cells_body(columns = vars(V1), rows = V1 < 0.5))
MethodSkewness Coefficient
Original data2.097
Square Root Transformation1.092
Cube Root Transformation0.737
Log Transformation−0.032

The square root transformation reduces the skewness coefficient remarkably from 2.1 to 1.09, however, it is still greater than 1. It is reduced to 0.74 by applying the cube root transformation, whereas the log transformation reduces the skewness to -0.032.

Let’s explore more the variability of the data.

do.call(cbind, (lapply(df, summary))) %>%
data.table(keep.rownames = T) %>%
gt(rowname_col = "rn") %>%
tab_spanner(label = "Methods to Reduce Skewness", columns = vars("sqrt_gdp_pc", "cbrt_gdp_pc", "log_gdp_pc")) %>%
tab_spanner(label = "Original Data", columns = vars("gdp_pc")) %>%
fmt_number(columns = ends_with("gdp_pc"), decimals = 3) %>%
cols_label(gdp_pc = "GDP per capita", sqrt_gdp_pc = "Square root transformation", cbrt_gdp_pc = "Cube root transformaiton", log_gdp_pc = "Log transformation") %>%
tab_style(style = cell_fill(color = "maroon", alpha = 1), locations = cells_body(columns = vars(gdp_pc))) %>%
tab_style(style = cell_fill(color = "maroon", alpha = 0.7), locations = cells_body(columns = vars(sqrt_gdp_pc))) %>%
tab_style(style = cell_fill(color = "maroon", alpha = 0.5), locations = cells_body(columns = vars(cbrt_gdp_pc))) %>%
tab_style(style = cell_fill(color = "maroon", alpha = 0.3), locations = cells_body(columns = vars(log_gdp_pc)))
## Warning: columns = vars(...) has been deprecated in gt 0.3.0:
## * please use columns = c(...) instead

## Warning: columns = vars(...) has been deprecated in gt 0.3.0:
## * please use columns = c(...) instead

## Warning: columns = vars(...) has been deprecated in gt 0.3.0:
## * please use columns = c(...) instead

## Warning: columns = vars(...) has been deprecated in gt 0.3.0:
## * please use columns = c(...) instead

## Warning: columns = vars(...) has been deprecated in gt 0.3.0:
## * please use columns = c(...) instead

## Warning: columns = vars(...) has been deprecated in gt 0.3.0:
## * please use columns = c(...) instead
Original DataMethods to Reduce Skewness
GDP per capitaSquare root transformationCube root transformaitonLog transformation
Min.261.24716.1636.3935.565
1st Qu.1,994.90644.66412.5897.598
Median6,564.92081.02418.7248.789
Mean14,536.95199.49520.4398.738
3rd Qu.17,401.722131.91625.9149.764
Max.114,704.594338.68148.58811.650

In the original data, the distance between the 1st quartile and the minimum value is 1733.7 and the distance between the 3rd quartile and the maximum value is 97302.9 prior to any kind of transformations. Nevertheless, it decreases at different levels after applying various forms of transformation. For instance, the difference between the 3rd quartile and the maximum value is 206.8 after the square root transformation, it is 22.7 after the cube root transformation, and it is 1.9 after the log transformation.

Another point of consideration is the distance between the mean and the median. The mean is 7972.03 points away from the median in the original data, whereas it is 18.5 after the square root transformation, 1.71 after the cube root transformation, and -0.05 after the log transformation.

Let’s make a quick visualization to better understand the shape of the distribution after applying each kind of above-mentioned methods.

df <- df %>%
pivot_longer(cols = everything(), names_to = "method", values_to = "values") %>%
mutate(method = factor(case_when(method == "gdp_pc" ~ 1, method == "sqrt_gdp_pc" ~ 2, method == "cbrt_gdp_pc" ~ 3, method == "log_gdp_pc" ~ 4),
labels = c("Original data", "Square root transformation", "Cube root transformation", "Log transformation")))

vline <- summarise(group_by(df, method), mean = mean(values), median = median(values))

df %>%
ggplot(aes(x = values)) +
geom_histogram(aes(y = ..density..), fill = "coral", alpha = 0.8, color = "gray") +
geom_density() + theme_bw() +
facet_wrap(~method, nrow = 2, scales = "free") +
geom_vline(data = vline, aes(xintercept = mean), linetype = "dotted") +
geom_vline(data = vline, aes(xintercept = median), linetype = "dashed") +
labs(x = NULL, caption = "Mean = Dotted line\nMedian = Dashed line") +
theme(strip.text = element_text(size = 15, face = "bold"))

As the above plots depict, the log transformation is pretty awesome. It reduces the skewness and improves the normality of the data more rapidly compared to the cube root or square root transformation.

# Dealing with the negative skewness

The important caveat in applying the above methods to reduce the skewness is the data should be skewed to the right side, otherwise those methods do not transform the distribution of the data to normal.

Let’s use the life expectancy data from the World Bank as an example. You can download the data from here.

Country NameIndicator Namevalue
1ArubaLife expectancy at birth, total (years)76.2
2AfghanistanLife expectancy at birth, total (years)64.5
3AngolaLife expectancy at birth, total (years)60.8
4AlbaniaLife expectancy at birth, total (years)78.5
5Arab WorldLife expectancy at birth, total (years)71.8
6United Arab EmiratesLife expectancy at birth, total (years)77.8
7ArgentinaLife expectancy at birth, total (years)76.5
8..236
237VanuatuLife expectancy at birth, total (years)70.3
238SamoaLife expectancy at birth, total (years)73.2
239KosovoLife expectancy at birth, total (years)72.2
240Yemen, Rep.Life expectancy at birth, total (years)66.1
241South AfricaLife expectancy at birth, total (years)63.9
242ZambiaLife expectancy at birth, total (years)63.5
243ZimbabweLife expectancy at birth, total (years)61.2
Source: The World Bank
Last update: 2020-07-01.

Below is the distribution of the above data which is moderately skewed to the left side with a skewness coefficient of -0.532.

df <- select(life_exp, life_exp = value)

ggplot(df, aes(x = life_exp)) +
geom_histogram(aes(y = ..density..), fill = "coral", color = "gray", alpha = 0.8) +
geom_density() +
theme_bw() + labs(x = NULL, title = 'Life expectancy')

One solution when dealing with the negatively skewed data is to reflect the values and apply the methods that are used to reduce the positive skewness on the reflected values. That is, subtract each value from a constant which is larger than the maximum value prior to transformation. The highest value in the above data is 84.9. Thus, I subtract each value from 86.

The point to notice after reflecting the data is that the extent of the skewness remains the same and the distribution of the data is as skewed as it was, though in the opposite direction.

Let’s apply the above methods to the reflected data.

df <- df %>%
mutate(reflected_life_exp = 86-life_exp,
sqrt_reflected_life_exp = sqrt(reflected_life_exp),
cbrt_reflected_life_exp = (reflected_life_exp)^(1/3),
log_reflected_life_exp = log(reflected_life_exp),
life_exp = NULL) %>%
pivot_longer(cols = everything(), names_to = "method", values_to = "values") %>%
mutate(method = factor(case_when(method == "reflected_life_exp" ~ 1, method == "sqrt_reflected_life_exp" ~ 2, method == "cbrt_reflected_life_exp" ~ 3, method == "log_reflected_life_exp" ~ 4),
labels = c("Reflected data", "Square root transformation", "Cube root transformation", "Log transformation")))

vline <- summarise(group_by(df, method), mean = mean(values), median = median(values))

df %>%
ggplot(aes(x = values)) +
geom_histogram(aes(y = ..density..), fill = "coral", alpha = 0.8, color = "gray") +
geom_density() + theme_bw() +
facet_wrap(~method, nrow = 2, scales = "free") +
geom_vline(data = vline, aes(xintercept = mean), linetype = "dotted") +
geom_vline(data = vline, aes(xintercept = median), linetype = "dashed") +
labs(x = NULL, caption = "Mean = Dotted line\nMedian = Dashed line") +
theme(strip.text = element_text(size = 15, face = "bold"))

Surprisingly, the square root and cube root transformation convert the distribution of the above data closer to the normal distribution compare to log transformation.

The important point to consider is that the values have been reflected. That is, the lower values represent high values and the higher values represent low values.

Other approaches to dealing with the negative skewness can be the inverse of the transformations that are used to reduce the positive skewness. For example, instead of computing square root convert $$x$$ to $$x^2$$ and instead of cube root convert $$x$$ to $$x^{3}$$.

# Conclusion

Different approaches reduce the extent of the skewness and boost the validity of the statistical inferences at different levels. Whatever method you use to transform the data into a normal distribution, the points you have to consider are a) never limit yourself to one standard method and examine several options before deciding which approach to use b) be cautious and always explore the math behind each approach. For example, let’s have a glance at the below graphs that illustrate the domain of log, square root, and cube root functions.

plot <- ggplot(data.frame(x = c(-25,25)), aes(x = x)) +
theme_bw() + geom_vline(xintercept = 0) + geom_hline(yintercept = 0)

sqrt <- plot + stat_function(fun = function(x){sqrt(x)}, color = 'maroon') + labs(title = "Square root")
cbrt <- plot + stat_function(fun = function(x){sign(x) * abs((x))^(1/3)}, color = 'maroon') + labs(title = "Cube root")  + xlim(c(-50, 50))
log <- plot + stat_function(fun = function(x){log10(x)}, color = 'maroon') + labs(title = "Log")

(log | sqrt) / (cbrt)

Since the domain for log is strictly greater than zero, it is reasonable to use log+1 instead of log transformation if the data have meaningful zero values. Furthermore, the square root is not defined for the negative numbers, thus you can consider adding a constant to convert the negative values to positive values prior to transformation or use cube root transformation instead of square root transformation. The kind of data you are working with, therefore, is literally foundational in deciding which method to use.