Tidy data

Cleaning and preparing data is a huge part of performing data analysis and a crucial part of this process has to do with the structure of the data set. In this course we are going to work with the tidyverse, which is a combination of R packages centered around the notion of tidy data. Tidy data refer to a standard way of organising table data. In words of their creators: “The principles of tidy data provide a standard way to organise data values within a data set”. The main idea of this standard is to link the structure of a data set (its physical layout) with its semantics (its meaning). In the tidyverse tables are called tibbles. (I guess that’s because they are tidy tables!)

Data structure

When we talk about data structure we’re talking about columns and rows, since we’re talking about table data. While columns are typically labeled, rows might be, but must not.

Data semantics

When talking about data semantics, there are three main concepts we need to keep in mind: observations, variables and values. A data set is a collection of values, which might be quantitative (numbers) or categorical (not numbers). These values to variables and observations. A variable contains all values of the same attribute. For instance, grammatical category might be a variable, with different possible categorical values (noun, verb, adjective…). Or we could have a variable with the number of syllables, with different possible numerical values (1, 2, 3…). An observation contains all values measured on the same unit. Depending on our study, the unit might be the word (imagine we want to investigate what kind of words appear in a given text), a sentence (if we wanted to analyse differential object marking in a corpus, for instance), a person (if we want to analyse a sample of people attending to their sociological characteristics, etc.).

Tidy data: linking structure and semantics

Tidy data follows three easy principles to link data structure and data semantics.

  1. Each variable forms a column (and only one).
  2. Each observation forms a row (and only one).
  3. Each value appears in one cell.

For instance, let’s take a look at the iris data set, which is preloaded in R. This data set contains information of 150 flowers of three different species of iris. Each flower is an observation, which means that all the information coded for one flower (i.e. each value) appears in one and the same row. The information coded for each flower refers to 5 different variables: sepal length, sepal width, petal length, petal width and species. As you can see, each variable has its own column. This is a tidy data set!

iris
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

There is a fourth principle in tidy data, which says that each type of observational unit forms a table. For instance, imagine a very common scenario in linguistics: you have a corpus of texts of different authors and want to analyse all transitive sentences in these texts: you want to know the verb they have, the verb tense and the subject’s person. Each transitive sentence is an observation and the three parameters you’re analysing are your three variables. But maybe you suspect that the characteristics of your sentences might differ from text to text, because they have different authors, from different places and were written in different years. You could add three columns with this information or, better, create a new table, where each text is an observation (=row) and the author, its birthplace and the year of the text are the variables. Of course, if you want this information to be useful, your tables should have a variable in the first column: an identifier of the text!

Key columns

And this idea of identifiers brings us to a very useful concept when creating databases: the key column. A key column is used to uniquely identify rows in a data frame, that is, key values (i.e., values in the key column) must be different in each row. Usually, these are some kind of identifier.

I highly recommend that you add a row with unique identifiers for your observations in all your tables. Chances are this will come in handy. For instance, it helps if you need to compare tables (previous versions of the same data frame, something that happens often if you’re analysing large quantities of data).

Moreover, key columns are very useful when combining tables. In our example above, the column with the text identifier in the second table (the one whose observations are the texts) is a key column. If you’ve also added the text identifier in your table with the sentences, you will be able to join your two tables easily in case you want to correlate all the information, thanks to your key column!

Let’s see how iris looks like with a key column:

iris %>% 
  mutate(ID = str_c("iris_", 1:150))
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species       ID
## 1            5.1         3.5          1.4         0.2     setosa   iris_1
## 2            4.9         3.0          1.4         0.2     setosa   iris_2
## 3            4.7         3.2          1.3         0.2     setosa   iris_3
## 4            4.6         3.1          1.5         0.2     setosa   iris_4
## 5            5.0         3.6          1.4         0.2     setosa   iris_5
## 6            5.4         3.9          1.7         0.4     setosa   iris_6
## 7            4.6         3.4          1.4         0.3     setosa   iris_7
## 8            5.0         3.4          1.5         0.2     setosa   iris_8
## 9            4.4         2.9          1.4         0.2     setosa   iris_9
## 10           4.9         3.1          1.5         0.1     setosa  iris_10
## 11           5.4         3.7          1.5         0.2     setosa  iris_11
## 12           4.8         3.4          1.6         0.2     setosa  iris_12
## 13           4.8         3.0          1.4         0.1     setosa  iris_13
## 14           4.3         3.0          1.1         0.1     setosa  iris_14
## 15           5.8         4.0          1.2         0.2     setosa  iris_15
## 16           5.7         4.4          1.5         0.4     setosa  iris_16
## 17           5.4         3.9          1.3         0.4     setosa  iris_17
## 18           5.1         3.5          1.4         0.3     setosa  iris_18
## 19           5.7         3.8          1.7         0.3     setosa  iris_19
## 20           5.1         3.8          1.5         0.3     setosa  iris_20
## 21           5.4         3.4          1.7         0.2     setosa  iris_21
## 22           5.1         3.7          1.5         0.4     setosa  iris_22
## 23           4.6         3.6          1.0         0.2     setosa  iris_23
## 24           5.1         3.3          1.7         0.5     setosa  iris_24
## 25           4.8         3.4          1.9         0.2     setosa  iris_25
## 26           5.0         3.0          1.6         0.2     setosa  iris_26
## 27           5.0         3.4          1.6         0.4     setosa  iris_27
## 28           5.2         3.5          1.5         0.2     setosa  iris_28
## 29           5.2         3.4          1.4         0.2     setosa  iris_29
## 30           4.7         3.2          1.6         0.2     setosa  iris_30
## 31           4.8         3.1          1.6         0.2     setosa  iris_31
## 32           5.4         3.4          1.5         0.4     setosa  iris_32
## 33           5.2         4.1          1.5         0.1     setosa  iris_33
## 34           5.5         4.2          1.4         0.2     setosa  iris_34
## 35           4.9         3.1          1.5         0.2     setosa  iris_35
## 36           5.0         3.2          1.2         0.2     setosa  iris_36
## 37           5.5         3.5          1.3         0.2     setosa  iris_37
## 38           4.9         3.6          1.4         0.1     setosa  iris_38
## 39           4.4         3.0          1.3         0.2     setosa  iris_39
## 40           5.1         3.4          1.5         0.2     setosa  iris_40
## 41           5.0         3.5          1.3         0.3     setosa  iris_41
## 42           4.5         2.3          1.3         0.3     setosa  iris_42
## 43           4.4         3.2          1.3         0.2     setosa  iris_43
## 44           5.0         3.5          1.6         0.6     setosa  iris_44
## 45           5.1         3.8          1.9         0.4     setosa  iris_45
## 46           4.8         3.0          1.4         0.3     setosa  iris_46
## 47           5.1         3.8          1.6         0.2     setosa  iris_47
## 48           4.6         3.2          1.4         0.2     setosa  iris_48
## 49           5.3         3.7          1.5         0.2     setosa  iris_49
## 50           5.0         3.3          1.4         0.2     setosa  iris_50
## 51           7.0         3.2          4.7         1.4 versicolor  iris_51
## 52           6.4         3.2          4.5         1.5 versicolor  iris_52
## 53           6.9         3.1          4.9         1.5 versicolor  iris_53
## 54           5.5         2.3          4.0         1.3 versicolor  iris_54
## 55           6.5         2.8          4.6         1.5 versicolor  iris_55
## 56           5.7         2.8          4.5         1.3 versicolor  iris_56
## 57           6.3         3.3          4.7         1.6 versicolor  iris_57
## 58           4.9         2.4          3.3         1.0 versicolor  iris_58
## 59           6.6         2.9          4.6         1.3 versicolor  iris_59
## 60           5.2         2.7          3.9         1.4 versicolor  iris_60
## 61           5.0         2.0          3.5         1.0 versicolor  iris_61
## 62           5.9         3.0          4.2         1.5 versicolor  iris_62
## 63           6.0         2.2          4.0         1.0 versicolor  iris_63
## 64           6.1         2.9          4.7         1.4 versicolor  iris_64
## 65           5.6         2.9          3.6         1.3 versicolor  iris_65
## 66           6.7         3.1          4.4         1.4 versicolor  iris_66
## 67           5.6         3.0          4.5         1.5 versicolor  iris_67
## 68           5.8         2.7          4.1         1.0 versicolor  iris_68
## 69           6.2         2.2          4.5         1.5 versicolor  iris_69
## 70           5.6         2.5          3.9         1.1 versicolor  iris_70
## 71           5.9         3.2          4.8         1.8 versicolor  iris_71
## 72           6.1         2.8          4.0         1.3 versicolor  iris_72
## 73           6.3         2.5          4.9         1.5 versicolor  iris_73
## 74           6.1         2.8          4.7         1.2 versicolor  iris_74
## 75           6.4         2.9          4.3         1.3 versicolor  iris_75
## 76           6.6         3.0          4.4         1.4 versicolor  iris_76
## 77           6.8         2.8          4.8         1.4 versicolor  iris_77
## 78           6.7         3.0          5.0         1.7 versicolor  iris_78
## 79           6.0         2.9          4.5         1.5 versicolor  iris_79
## 80           5.7         2.6          3.5         1.0 versicolor  iris_80
## 81           5.5         2.4          3.8         1.1 versicolor  iris_81
## 82           5.5         2.4          3.7         1.0 versicolor  iris_82
## 83           5.8         2.7          3.9         1.2 versicolor  iris_83
## 84           6.0         2.7          5.1         1.6 versicolor  iris_84
## 85           5.4         3.0          4.5         1.5 versicolor  iris_85
## 86           6.0         3.4          4.5         1.6 versicolor  iris_86
## 87           6.7         3.1          4.7         1.5 versicolor  iris_87
## 88           6.3         2.3          4.4         1.3 versicolor  iris_88
## 89           5.6         3.0          4.1         1.3 versicolor  iris_89
## 90           5.5         2.5          4.0         1.3 versicolor  iris_90
## 91           5.5         2.6          4.4         1.2 versicolor  iris_91
## 92           6.1         3.0          4.6         1.4 versicolor  iris_92
## 93           5.8         2.6          4.0         1.2 versicolor  iris_93
## 94           5.0         2.3          3.3         1.0 versicolor  iris_94
## 95           5.6         2.7          4.2         1.3 versicolor  iris_95
## 96           5.7         3.0          4.2         1.2 versicolor  iris_96
## 97           5.7         2.9          4.2         1.3 versicolor  iris_97
## 98           6.2         2.9          4.3         1.3 versicolor  iris_98
## 99           5.1         2.5          3.0         1.1 versicolor  iris_99
## 100          5.7         2.8          4.1         1.3 versicolor iris_100
## 101          6.3         3.3          6.0         2.5  virginica iris_101
## 102          5.8         2.7          5.1         1.9  virginica iris_102
## 103          7.1         3.0          5.9         2.1  virginica iris_103
## 104          6.3         2.9          5.6         1.8  virginica iris_104
## 105          6.5         3.0          5.8         2.2  virginica iris_105
## 106          7.6         3.0          6.6         2.1  virginica iris_106
## 107          4.9         2.5          4.5         1.7  virginica iris_107
## 108          7.3         2.9          6.3         1.8  virginica iris_108
## 109          6.7         2.5          5.8         1.8  virginica iris_109
## 110          7.2         3.6          6.1         2.5  virginica iris_110
## 111          6.5         3.2          5.1         2.0  virginica iris_111
## 112          6.4         2.7          5.3         1.9  virginica iris_112
## 113          6.8         3.0          5.5         2.1  virginica iris_113
## 114          5.7         2.5          5.0         2.0  virginica iris_114
## 115          5.8         2.8          5.1         2.4  virginica iris_115
## 116          6.4         3.2          5.3         2.3  virginica iris_116
## 117          6.5         3.0          5.5         1.8  virginica iris_117
## 118          7.7         3.8          6.7         2.2  virginica iris_118
## 119          7.7         2.6          6.9         2.3  virginica iris_119
## 120          6.0         2.2          5.0         1.5  virginica iris_120
## 121          6.9         3.2          5.7         2.3  virginica iris_121
## 122          5.6         2.8          4.9         2.0  virginica iris_122
## 123          7.7         2.8          6.7         2.0  virginica iris_123
## 124          6.3         2.7          4.9         1.8  virginica iris_124
## 125          6.7         3.3          5.7         2.1  virginica iris_125
## 126          7.2         3.2          6.0         1.8  virginica iris_126
## 127          6.2         2.8          4.8         1.8  virginica iris_127
## 128          6.1         3.0          4.9         1.8  virginica iris_128
## 129          6.4         2.8          5.6         2.1  virginica iris_129
## 130          7.2         3.0          5.8         1.6  virginica iris_130
## 131          7.4         2.8          6.1         1.9  virginica iris_131
## 132          7.9         3.8          6.4         2.0  virginica iris_132
## 133          6.4         2.8          5.6         2.2  virginica iris_133
## 134          6.3         2.8          5.1         1.5  virginica iris_134
## 135          6.1         2.6          5.6         1.4  virginica iris_135
## 136          7.7         3.0          6.1         2.3  virginica iris_136
## 137          6.3         3.4          5.6         2.4  virginica iris_137
## 138          6.4         3.1          5.5         1.8  virginica iris_138
## 139          6.0         3.0          4.8         1.8  virginica iris_139
## 140          6.9         3.1          5.4         2.1  virginica iris_140
## 141          6.7         3.1          5.6         2.4  virginica iris_141
## 142          6.9         3.1          5.1         2.3  virginica iris_142
## 143          5.8         2.7          5.1         1.9  virginica iris_143
## 144          6.8         3.2          5.9         2.3  virginica iris_144
## 145          6.7         3.3          5.7         2.5  virginica iris_145
## 146          6.7         3.0          5.2         2.3  virginica iris_146
## 147          6.3         2.5          5.0         1.9  virginica iris_147
## 148          6.5         3.0          5.2         2.0  virginica iris_148
## 149          6.2         3.4          5.4         2.3  virginica iris_149
## 150          5.9         3.0          5.1         1.8  virginica iris_150

I simply added a column with identifiers, all of which start by iris_, followed by a different number. And now you can keep track of every single flower with their own… name in a way. Nice, isn’t it?

So, remember…

When preparing your data set, think carefully:

  1. What are your observations?

  2. What are your variables?

  3. How many observational units do you have?

  4. And remember to add key columns!