When we are dealing with a large data set, we want to have a good idea of how the data is distributed, in order to understand our sample better. Visualizing our data can be very helpful for this. Let’s discuss two types of plots that can be used to visualize a summary of our data, when our data consists of continuous values, i.e., numerical data that might have infinite values between two given values, such as age, height, time…
A histogram gives you an overview of the distribution of your data set. It only works with continuous data. Although it looks like a bar plot, it’s not one! In a histogram you can see how your data points are distributed across your values. Data is split into intervals (called bins).
Imagine we have a sample of 20 people and we have their ages, which I will save in a vector called ages
:
ages <- c(22, 25, 36, 36, 38, 38, 45, 46, 46, 48, 52, 55, 55, 55, 58, 61, 67, 68, 72, 91)
If we set 10-year bins, the histogram will show us how many people there are in each bin. As you can see, there are 2 people (y-axis) in their twenties (x-axis), 5 people (y-axis) in their fifties (x-axis), etc.
You can set the size of the bins, taking into account the size and distribution of your sample, because what you want from a histogram is a reasonable summary of your data. In the following histogram, each bin comprises one single year, but these bins are too small for our sample: there are way too many blank spaces.
ages %>%
as_tibble_col("ages") %>%
ggplot(aes(x = ages)) +
geom_histogram(fill = "darkgreen", binwidth = 1, boundary = 20) +
labs(title = "Distrubution of the sample by age", x = "Number of people") +
theme_bw()
In the following histogram, each bin comprises thirty years, which seem to be too large for our sample, since we don’t get to see any clear pattern.
ages %>%
as_tibble_col("ages") %>%
ggplot(aes(x = ages)) +
geom_histogram(fill = "darkgreen", binwidth = 30, boundary = 20) +
labs(title = "Distrubution of the sample by age", x = "Number of people") +
theme_bw()
Box plots are a visualization of many of the summary statistics we’ve already discussed in 3. Summary Statistics. They consist of a box (hence their name) and whiskers. Outside the whisker it may or may not have some dots. A lot of information is encoded here. Let’s look at it slowly with the example below, which is a box plot of the same sample of ages discussed above. The line within the box represents the median, which, as you can see, lies at 50 years old in our sample. Remember, this means that half of the sample is younger than the median and that the other half is older. The box delimits the first (below) and the third quartile (above). This means that, within the box, you have the most central half of the sample. The whiskers do not exactly represent the maximum and the minimum of the sample, but two cut points outside of which any values are considered rather extreme, that is, that they behave very differently from most data points. These extreme values are called outliers and are represented as dots outside the whisker’s range. The range of the whiskers is calculated using the interquartile range (IQR), which is the difference between the first and third quartile. The upper limit of the whiskers is calculated by adding 1.5 times the IQR to the third quartile (Q3 + 1.5IQR), while the lower limit is calculated by subtracting 1.5 times the IQR to the first quartile (Q1 - 1.5IQR). Detecting outliers can be useful, because they might be data points that show errors, for instance. However, you must not remove outliers from the sample unless you’re sure that they are mistakes or you have very good theoretical reasons! Otherwise… you’re cheating yourself (and anyone who reads your work).
ages %>%
as_tibble_col("ages") %>%
ggplot(aes(y = ages)) +
geom_boxplot() +
labs(title = "Distrubution of the sample by age") +
theme_bw()
Violin convey pretty similar information to box plots. The whole violin encompasses the whole range of the data. It is divided in four parts by three segments, which are… the three quartiles!! But, differently from box plots, the shape of each part of the violin delimited by quartiles also represents the distribution of the data within that part of the sample: this is called density distribution. (A density distribution which has been smoothed, by the way). The following violin plot visualizes our vector ages
.
ages %>%
as_tibble_col("ages") %>%
mutate(sample = "people") %>%
ggplot(aes(y = ages, x = sample)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
labs(title = "Distrubution of the sample by age") +
theme_bw()
Dot plots give you the whole distribution: each dot can be a single observation (although it can also be a bin with several observations). Violin and dot plots give you very similar information: while the former are more synthetic, the latter are more exact. You should choose the one that fits your purpose and audience better! Keep in mind that you don’t get our quartile information with dot plots! The following dot plot visualizes our vector ages
(each dot represents a single observation).
ages %>%
as_tibble_col("ages") %>%
mutate(sample = "people") %>%
ggplot(aes(y = ages, x = sample)) +
geom_dotplot(binwidth = 1, binaxis = "y", stackdir = "center") +
labs(title = "Distrubution of the sample by age") +
theme_bw()