Measures of Dispersion

Introduction

In the previous post I described the measures of central tendency. But the central tendency is not the only thing you can tell interesting facts about the data and is not the only way by which you can get to know about the concentration of the data. In this post, you will learn about the measures of dispersion as part of the descriptive statistics.

As the name suggests, the measures of dispersion show the extent of variability and the scattering of the data points. The main idea of the measures of dispersion is to get to know how the data are spread and how much the data points vary from the average value.

Two distinct sets of data may have the same central value, but a completely different level of variation. Therefore, an adequate description of the data should include both of these characteristics. In other words, the combination of measures of the central tendency and measures of dispersion help to understand the distribution of the data.

A measure of dispersion is zero if all the data points are the same and increases as the data become more diverse. There are mainly two types of measures of dispersion.

  • Absolute measures of dispersion
  • Relative measures of dispersion

Absolute measures of dispersion

Absolute measures of dispersion express the scattering of the data points in terms of distance such as range or in terms of deviation from the central value such as variance and standard deviation.

Range: Range is defined as the difference between the smallest and the largest value in a set of data. The range is easy to compute; however, it is influenced by extreme values. Therefore, it is not a reliable measure of dispersion.

\[Range = X_{max} - X_{min}\]

Quartile deviation: Quartile deviation is defined as half of the distance between the first and third quartile1. Quartile deviation is not influenced by extreme values. However, its demerit is that it ignores 50% of the data. Therefore, variance and standard deviation are suggested as the most reliable measures of dispersion.

\[Quartile \space deviation = \frac{Q_{3}-Q_{1}}{\ 2}\]

Variance and Standard Deviation2: These measures of dispersion tell you how much spread out the data points are from the mean. To find out the variance, deduct each value from the mean, square it, sum each square, and divide it by the total number of values.

\[Variance = \frac{∑(x-\bar{x})^2}{\ {n-1}}\]

Standard deviation is the square root of the variance. In the asymmetrical distribution, 68.25% of data points fall between mean ± 1s.d; 95.45% of data points fall between mean ± 2s.d; 99.73% of the data points fall between mean ± 3s.d.

Mathematically, the standard deviation can be expressed as below:

\[{Standard \space deviation} = {\sqrt\frac{∑(x-\bar{x})^2}{\ {n-1}}}\]

No panic! In R, you can easily compute the range, quartile deviation, variance, and standard deviation. Suppose you have the weekly expenditures of two projects over 10 weeks.

## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'dplyr' was built under R version 4.0.5
##    Projects Expenditures
## 1  project1        10000
## 2  project1        15400
## 3  project1        14250
## 4  project1        13000
## 5  project1        11250
## 6  project1        10450
## 7  project1         9035
## 8  project1        12500
## 9  project1        14125
## 10 project1        11240
## 11 project2        10500
## 12 project2        15000
## 13 project2        14300
## 14 project2        12500
## 15 project2        11300
## 16 project2        10500
## 17 project2         8530
## 18 project2        12500
## 19 project2        14120
## 20 project2        11320

The below functions can be used to compute the measures of dispersion.

library(tidyverse)

df %>% 
  group_by(Projects) %>% 
  summarise(Range = max(Expenditures) - min(Expenditures),
            'Quartile Deviation' = IQR(Expenditures)/2,
            Variance = var(Expenditures),
            'Standard Deviation' = sd(Expenditures)) %>% 
  kable()
ProjectsRangeQuartile DeviationVarianceStandard Deviation
project163651598.12542850782070.043
project264701507.50040828012020.594

As the above table shows, based on Range as a measure of dispersion that includes only minimum and maximum values, the data points in the second group (project2) are more scattered while based on the Standard deviation the data points in that group are less scattered3.

Relative measures of dispersion

For comparing data among two or more than two groups that differ significantly in their averages, and for unit free comparison the relative measures of dispersion are used which is known as the coefficient of dispersion (C.D).

Coefficient of dispersion in terms of range: C.D in terms of range is the distance between the minimum value and maximum value divided by sum of the minimum and maximum values.

\[{C.D\space in\space terms\space of\space range} = {\frac{X_{max} - X_{min}}{\ X_{max} + X_{min}}}\]

Coefficient of dispersion in terms of quartile deviation: C.D in terms of quartile deviation is the distance between first quartile and third quartile divided by the sum of the first and third quartiles.

\[{C.D\space in\space terms\space of\space quartile \space deviation} = {\frac{Q_{3} - Q_{1}}{\ Q_{3} + Q_{1}}}\]

Coefficient of dispersion in terms of standard deviation: C.D in terms of standard deviation is defined as the standard deviation divided by the mean.

\[{C.D\space in\space terms\space of\space S.D} = {\frac{S.D}{\bar{X}}}\]

Coefficient of Variation (C.V): 100 times the coefficient of dispersion based on the standard deviation is the coefficient of variation.

\[C.V = 100 * \frac{S.D}{\bar{X}}\]

Let’s find the relative measures of dispersion for the above data.

library(raster)

df %>% 
  group_by(Projects) %>% 
  summarise('C.D in terms of range' = (max(Expenditures)-min(Expenditures)/max(Expenditures)+min(Expenditures)),
            'C.D in terms of standard deviation' = sd(Expenditures)/mean(Expenditures),
            # 'Coefficient of Variation' = 100 * sd(Expenditures)/mean(Expenditures),
            'Coefficient of Variation' = cv(Expenditures)) %>% 
  kable()
ProjectsC.D in terms of rangeC.D in terms of standard deviationCoefficient of Variation
project124434.410.170725217.07252
project223529.430.167586816.75868

  1. Quartiles are values that divide the data into quarters. The first quartile (Q1) is the middle number between the smallest number and the median of the data. The second quartile, (Q2) is the median of the data set. The third quartile (Q3) is the middle number between the median and the largest number.

  2. In R, the var() and sd() functions compute the sample variance and sample standard deviation. Therefore, the n-1 is used in the denominator.

  3. Standard deviation as measure of dispersion to compare variability among two groups should be used only when both groups have the same central value. When the central value of both groups differ widely, the coefficient of dispersion in terms of standard deviation or coefficient of variance should be used.

Related