While we are not focusing on statistical analysis in this course, using some crucial concepts is unavoidable. In case you don’t remember much about the basic summary statistics, here you have some definitions (and how to calculate them in R).
Let’s start by the three central tendency measures: the mean, the median and the mode.
The mean is the sum of the values divided by the number of values you have. In R we can calculate it with the function mean()
, which takes a numerical vector.
mean(c(1, 2, 3, 4, 5))
## [1] 3
(1 + 2 + 3 + 4 + 5) / 5
## [1] 3
#As you can see, these two chunks of code produce the same result:
mean(c(1, 2, 3, 4, 5)) == (1 + 2 + 3 + 4 + 5) / 5
## [1] TRUE
It is important to remember that the mean is very much affected by outliers, that is, data points that differ significantly from other observations. A typical example to understand this is to calculate the mean salary of the people that you’ll find tonight in your favourite bar. Now imagine that Cristiano Ronaldo walks into the bar. The mean salary just hugely increased, but that’s only because Cristiano’s salary is (most likely) an outlier in your sample: the mean is not very informative anymore.
The median is the central value of a data set, that is, the value found precisely between the higher half and the lower half of a data set (i.e. the middle value). The median is much less affected by outliers than the mean, so it’s always good to compare both measures.
In R we can calculate it with the function median()
, which takes a numerical vector.
median(c(1, 2, 3, 4, 5))
## [1] 3
The mode is the value that appears most often in a data set and it is also the only central tendency measure that also applies to categorical data, that is, not only to numerical data.
We’ll be calculating the mode using the tidyverse function count()
, which counts how many tokens of every type (i.e. distinct value) there are. It applies to table columns, so this is a bit more advanced right now, but we’ll learn this soon enough, don’t worry! With the argument sort = T
the mode will be the first value we get.
as_tibble((c(1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5))) %>% count(value, sort = T)
## # A tibble: 5 x 2
## value n
## <dbl> <int>
## 1 2 4
## 2 3 3
## 3 1 2
## 4 4 2
## 5 5 1
as_tibble((c("blue", "blue", "blue", "blue", "green", "green", "green", "brown", "brown", "brown", "brown", "brown", "brown", "brown", "brown"))) %>% count(value, sort = TRUE)
## # A tibble: 3 x 2
## value n
## <chr> <int>
## 1 brown 8
## 2 blue 4
## 3 green 3
In a data set of numerical values we have the lowest value (minimum), the highest one (maximum) and the difference between them (range), which gives us a nice idea of how broad our sample is.
In R, functions min()
and max()
give us the minimum and maximum values and range()
give us both values. If we want to calculate the difference between them we can simply subtract the minimum from the maximum. All these functions take a numerical vector:
min(c(1, 2, 3, 4, 5))
## [1] 1
max(c(1, 2, 3, 4, 5))
## [1] 5
range(c(1, 2, 3, 4, 5))
## [1] 1 5
max(c(1, 2, 3, 4, 5)) - min(c(1, 2, 3, 4, 5))
## [1] 4
Quartiles are cut points that divide the data set in 4 equally-sized parts. There are three quartiles in a given data set, 1st quartile, 3rd quartile and… can you guess? If you’ve said “2nd quartile” to yourself, you’re right, but… the proper name for that is… the median! Think about it ;)
Quantiles is the neutral term used for grouping a distribution in any given number of equally-sized groups. There’s always one fewer quantile than groups created, since they are the cut points between the groups.
The function quantile()
calculates quartiles by default. However, using the argument probs
we can ask for different cutpoints. Check the result: the cutpoints under 0% and 100% are the minimum and maximum, the ones under 25% and 75% are the first and third quartile and the one under 50% is the median.
quantile(c(1, 2, 3, 4, 5))
## 0% 25% 50% 75% 100%
## 1 2 3 4 5
quantile(c(1, 2, 3, 4, 5), probs = seq(0, 1, 0.2))
## 0% 20% 40% 60% 80% 100%
## 1.0 1.8 2.6 3.4 4.2 5.0
The standard deviation is a measure of how disperse (from the mean) our data are. It is the average distance between any given value in your data and the mean, which means that the standard deviation is also an average value. It tells us how far away from the mean a data point is on average. It’s formula is square root of the average value of the squared difference of a value from the mean, but you don’t need to memorise that (of course, you’re welcome to read it a few times until it makes sense!). In R we can calculate it with the function sd()
, which takes a numerical vector:
mean(c(1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5))
## [1] 2.666667
sd(c(1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5))
## [1] 1.230915