Before going directly to measures of central tendency, I want you to look at the data below. This is a table of weekly expenditures of two projects over the course of 10 weeks. Now tell me which one is cost-efficient than the other, or both are the same?
project1 | project2 |
---|---|
10000 | 10500 |
15400 | 15000 |
14250 | 14300 |
13000 | 12500 |
11250 | 11300 |
10450 | 10500 |
9035 | 9030 |
12500 | 12500 |
14125 | 14120 |
11240 | 11320 |
What if I show you this?
projects | average |
---|---|
project1 | 12125 |
project2 | 12107 |
Using the average, it is easy to compare the expenditures of both projects. This is one of the primary purposes of the measures of central tendency to summarize the data into a single number that represents the center point of the data.
Measures of central tendency show the point where most values of a the data fall and represent the tendency of the data to cluster around a middle value using different approaches. Selecting the appropriate method, however, depends on the type of data you are dealing with. In this post, you will learn when to use a particular measure of central tendency and how to calculate it using R.
As mean is susceptible to the presence of outliers1, it loses its ability to provide the best central location when the data is skewed. As the above figures show that mean is dragged in the direction of the skew. Thus, using only the mean to approximate the center of the data can often be misleading.
In R, the mean()
function can be used to
calculate the arithmetic mean. It takes a vector of values as
input and returns the average.
# use (na.rm = TRUE) if there is any missing value in the list of values
mean(c(10000, 15400, 14250, 13000, 11250, 10450, 9035, 12500, 14125, 11240), na.rm = FALSE)
## [1] 12125
The median()
function can be used to compute the
median. It takes a vector of values as input and returns the
value that is occurred most frequently.
# use (na.rm = TRUE) if there is any missing value in the list of values
median(c(10000, 15400, 14250, 13000, 11250, 10450, 9035, 12500, 14125, 11240), na.rm = FALSE)
## [1] 11875
To the best of my knowledge, there is not any built-in function in R to find out mode in a vector of values. Therefore, below two steps must be followed:
table()
function.which.max()
function to find the mode.Also, you can use below code to program a new function called Mode to find the mode in a given vector of values.
Mode <- function(x) {
as.character(
pull(
filter(
data.frame(
table(x)
),
Freq == max(Freq)),
x)
)
}
Mode(c(2, 4, 7, 2, 3, 2, 5, 7, 9))
## [1] "2"
The following table shows what the best measure of central tendency is with respect to different types of variables.
Types of the variable | The best measure of central tendency |
---|---|
Nominal | Mode |
Ordinal | Median / Mode |
Continuous (not skewed) | Mean |
Continuous (skewed) | Median |
Outliers are values that are notably different from the rest of the data. Usually, any value larger than 1.5IQR above the third quartile or lower than 1.5IQR below the first quartile is considered as outlier.↩︎