Cleaning and preparing data is a huge part of performing data analysis and a crucial part of this process has to do with the structure of the data set. In this course we are going to work with the tidyverse, which is a combination of R packages centered around the notion of tidy data. Tidy data refer to a standard way of organising table data. In words of their creators: “The principles of tidy data provide a standard way to organise data values within a data set”. The main idea of this standard is to link the structure of a data set (its physical layout) with its semantics (its meaning). In the tidyverse tables are called tibbles. (I guess that’s because they are tidy tables!)
When we talk about data structure we’re talking about columns and rows, since we’re talking about table data. While columns are typically labeled, rows might be, but must not.
When talking about data semantics, there are three main concepts we need to keep in mind: observations, variables and values. A data set is a collection of values, which might be quantitative (numbers) or categorical (not numbers). These values to variables and observations. A variable contains all values of the same attribute. For instance, grammatical category might be a variable, with different possible categorical values (noun, verb, adjective…). Or we could have a variable with the number of syllables, with different possible numerical values (1, 2, 3…). An observation contains all values measured on the same unit. Depending on our study, the unit might be the word (imagine we want to investigate what kind of words appear in a given text), a sentence (if we wanted to analyse differential object marking in a corpus, for instance), a person (if we want to analyse a sample of people attending to their sociological characteristics, etc.).
Tidy data follows three easy principles to link data structure and data semantics.
For instance, let’s take a look at the iris data set, which is preloaded in R. This data set contains information of 150 flowers of three different species of iris. Each flower is an observation, which means that all the information coded for one flower (i.e. each value) appears in one and the same row. The information coded for each flower refers to 5 different variables: sepal length, sepal width, petal length, petal width and species. As you can see, each variable has its own column. This is a tidy data set!
iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
There is a fourth principle in tidy data, which says that each type of observational unit forms a table. For instance, imagine a very common scenario in linguistics: you have a corpus of texts of different authors and want to analyse all transitive sentences in these texts: you want to know the verb they have, the verb tense and the subject’s person. Each transitive sentence is an observation and the three parameters you’re analysing are your three variables. But maybe you suspect that the characteristics of your sentences might differ from text to text, because they have different authors, from different places and were written in different years. You could add three columns with this information or, better, create a new table, where each text is an observation (=row) and the author, its birthplace and the year of the text are the variables. Of course, if you want this information to be useful, your tables should have a variable in the first column: an identifier of the text!
And this idea of identifiers brings us to a very useful concept when creating databases: the key column. A key column is used to uniquely identify rows in a data frame, that is, key values (i.e., values in the key column) must be different in each row. Usually, these are some kind of identifier.
I highly recommend that you add a row with unique identifiers for your observations in all your tables. Chances are this will come in handy. For instance, it helps if you need to compare tables (previous versions of the same data frame, something that happens often if you’re analysing large quantities of data).
Moreover, key columns are very useful when combining tables. In our example above, the column with the text identifier in the second table (the one whose observations are the texts) is a key column. If you’ve also added the text identifier in your table with the sentences, you will be able to join your two tables easily in case you want to correlate all the information, thanks to your key column!
Let’s see how iris
looks like with a key column:
iris %>%
mutate(ID = str_c("iris_", 1:150))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID
## 1 5.1 3.5 1.4 0.2 setosa iris_1
## 2 4.9 3.0 1.4 0.2 setosa iris_2
## 3 4.7 3.2 1.3 0.2 setosa iris_3
## 4 4.6 3.1 1.5 0.2 setosa iris_4
## 5 5.0 3.6 1.4 0.2 setosa iris_5
## 6 5.4 3.9 1.7 0.4 setosa iris_6
## 7 4.6 3.4 1.4 0.3 setosa iris_7
## 8 5.0 3.4 1.5 0.2 setosa iris_8
## 9 4.4 2.9 1.4 0.2 setosa iris_9
## 10 4.9 3.1 1.5 0.1 setosa iris_10
## 11 5.4 3.7 1.5 0.2 setosa iris_11
## 12 4.8 3.4 1.6 0.2 setosa iris_12
## 13 4.8 3.0 1.4 0.1 setosa iris_13
## 14 4.3 3.0 1.1 0.1 setosa iris_14
## 15 5.8 4.0 1.2 0.2 setosa iris_15
## 16 5.7 4.4 1.5 0.4 setosa iris_16
## 17 5.4 3.9 1.3 0.4 setosa iris_17
## 18 5.1 3.5 1.4 0.3 setosa iris_18
## 19 5.7 3.8 1.7 0.3 setosa iris_19
## 20 5.1 3.8 1.5 0.3 setosa iris_20
## 21 5.4 3.4 1.7 0.2 setosa iris_21
## 22 5.1 3.7 1.5 0.4 setosa iris_22
## 23 4.6 3.6 1.0 0.2 setosa iris_23
## 24 5.1 3.3 1.7 0.5 setosa iris_24
## 25 4.8 3.4 1.9 0.2 setosa iris_25
## 26 5.0 3.0 1.6 0.2 setosa iris_26
## 27 5.0 3.4 1.6 0.4 setosa iris_27
## 28 5.2 3.5 1.5 0.2 setosa iris_28
## 29 5.2 3.4 1.4 0.2 setosa iris_29
## 30 4.7 3.2 1.6 0.2 setosa iris_30
## 31 4.8 3.1 1.6 0.2 setosa iris_31
## 32 5.4 3.4 1.5 0.4 setosa iris_32
## 33 5.2 4.1 1.5 0.1 setosa iris_33
## 34 5.5 4.2 1.4 0.2 setosa iris_34
## 35 4.9 3.1 1.5 0.2 setosa iris_35
## 36 5.0 3.2 1.2 0.2 setosa iris_36
## 37 5.5 3.5 1.3 0.2 setosa iris_37
## 38 4.9 3.6 1.4 0.1 setosa iris_38
## 39 4.4 3.0 1.3 0.2 setosa iris_39
## 40 5.1 3.4 1.5 0.2 setosa iris_40
## 41 5.0 3.5 1.3 0.3 setosa iris_41
## 42 4.5 2.3 1.3 0.3 setosa iris_42
## 43 4.4 3.2 1.3 0.2 setosa iris_43
## 44 5.0 3.5 1.6 0.6 setosa iris_44
## 45 5.1 3.8 1.9 0.4 setosa iris_45
## 46 4.8 3.0 1.4 0.3 setosa iris_46
## 47 5.1 3.8 1.6 0.2 setosa iris_47
## 48 4.6 3.2 1.4 0.2 setosa iris_48
## 49 5.3 3.7 1.5 0.2 setosa iris_49
## 50 5.0 3.3 1.4 0.2 setosa iris_50
## 51 7.0 3.2 4.7 1.4 versicolor iris_51
## 52 6.4 3.2 4.5 1.5 versicolor iris_52
## 53 6.9 3.1 4.9 1.5 versicolor iris_53
## 54 5.5 2.3 4.0 1.3 versicolor iris_54
## 55 6.5 2.8 4.6 1.5 versicolor iris_55
## 56 5.7 2.8 4.5 1.3 versicolor iris_56
## 57 6.3 3.3 4.7 1.6 versicolor iris_57
## 58 4.9 2.4 3.3 1.0 versicolor iris_58
## 59 6.6 2.9 4.6 1.3 versicolor iris_59
## 60 5.2 2.7 3.9 1.4 versicolor iris_60
## 61 5.0 2.0 3.5 1.0 versicolor iris_61
## 62 5.9 3.0 4.2 1.5 versicolor iris_62
## 63 6.0 2.2 4.0 1.0 versicolor iris_63
## 64 6.1 2.9 4.7 1.4 versicolor iris_64
## 65 5.6 2.9 3.6 1.3 versicolor iris_65
## 66 6.7 3.1 4.4 1.4 versicolor iris_66
## 67 5.6 3.0 4.5 1.5 versicolor iris_67
## 68 5.8 2.7 4.1 1.0 versicolor iris_68
## 69 6.2 2.2 4.5 1.5 versicolor iris_69
## 70 5.6 2.5 3.9 1.1 versicolor iris_70
## 71 5.9 3.2 4.8 1.8 versicolor iris_71
## 72 6.1 2.8 4.0 1.3 versicolor iris_72
## 73 6.3 2.5 4.9 1.5 versicolor iris_73
## 74 6.1 2.8 4.7 1.2 versicolor iris_74
## 75 6.4 2.9 4.3 1.3 versicolor iris_75
## 76 6.6 3.0 4.4 1.4 versicolor iris_76
## 77 6.8 2.8 4.8 1.4 versicolor iris_77
## 78 6.7 3.0 5.0 1.7 versicolor iris_78
## 79 6.0 2.9 4.5 1.5 versicolor iris_79
## 80 5.7 2.6 3.5 1.0 versicolor iris_80
## 81 5.5 2.4 3.8 1.1 versicolor iris_81
## 82 5.5 2.4 3.7 1.0 versicolor iris_82
## 83 5.8 2.7 3.9 1.2 versicolor iris_83
## 84 6.0 2.7 5.1 1.6 versicolor iris_84
## 85 5.4 3.0 4.5 1.5 versicolor iris_85
## 86 6.0 3.4 4.5 1.6 versicolor iris_86
## 87 6.7 3.1 4.7 1.5 versicolor iris_87
## 88 6.3 2.3 4.4 1.3 versicolor iris_88
## 89 5.6 3.0 4.1 1.3 versicolor iris_89
## 90 5.5 2.5 4.0 1.3 versicolor iris_90
## 91 5.5 2.6 4.4 1.2 versicolor iris_91
## 92 6.1 3.0 4.6 1.4 versicolor iris_92
## 93 5.8 2.6 4.0 1.2 versicolor iris_93
## 94 5.0 2.3 3.3 1.0 versicolor iris_94
## 95 5.6 2.7 4.2 1.3 versicolor iris_95
## 96 5.7 3.0 4.2 1.2 versicolor iris_96
## 97 5.7 2.9 4.2 1.3 versicolor iris_97
## 98 6.2 2.9 4.3 1.3 versicolor iris_98
## 99 5.1 2.5 3.0 1.1 versicolor iris_99
## 100 5.7 2.8 4.1 1.3 versicolor iris_100
## 101 6.3 3.3 6.0 2.5 virginica iris_101
## 102 5.8 2.7 5.1 1.9 virginica iris_102
## 103 7.1 3.0 5.9 2.1 virginica iris_103
## 104 6.3 2.9 5.6 1.8 virginica iris_104
## 105 6.5 3.0 5.8 2.2 virginica iris_105
## 106 7.6 3.0 6.6 2.1 virginica iris_106
## 107 4.9 2.5 4.5 1.7 virginica iris_107
## 108 7.3 2.9 6.3 1.8 virginica iris_108
## 109 6.7 2.5 5.8 1.8 virginica iris_109
## 110 7.2 3.6 6.1 2.5 virginica iris_110
## 111 6.5 3.2 5.1 2.0 virginica iris_111
## 112 6.4 2.7 5.3 1.9 virginica iris_112
## 113 6.8 3.0 5.5 2.1 virginica iris_113
## 114 5.7 2.5 5.0 2.0 virginica iris_114
## 115 5.8 2.8 5.1 2.4 virginica iris_115
## 116 6.4 3.2 5.3 2.3 virginica iris_116
## 117 6.5 3.0 5.5 1.8 virginica iris_117
## 118 7.7 3.8 6.7 2.2 virginica iris_118
## 119 7.7 2.6 6.9 2.3 virginica iris_119
## 120 6.0 2.2 5.0 1.5 virginica iris_120
## 121 6.9 3.2 5.7 2.3 virginica iris_121
## 122 5.6 2.8 4.9 2.0 virginica iris_122
## 123 7.7 2.8 6.7 2.0 virginica iris_123
## 124 6.3 2.7 4.9 1.8 virginica iris_124
## 125 6.7 3.3 5.7 2.1 virginica iris_125
## 126 7.2 3.2 6.0 1.8 virginica iris_126
## 127 6.2 2.8 4.8 1.8 virginica iris_127
## 128 6.1 3.0 4.9 1.8 virginica iris_128
## 129 6.4 2.8 5.6 2.1 virginica iris_129
## 130 7.2 3.0 5.8 1.6 virginica iris_130
## 131 7.4 2.8 6.1 1.9 virginica iris_131
## 132 7.9 3.8 6.4 2.0 virginica iris_132
## 133 6.4 2.8 5.6 2.2 virginica iris_133
## 134 6.3 2.8 5.1 1.5 virginica iris_134
## 135 6.1 2.6 5.6 1.4 virginica iris_135
## 136 7.7 3.0 6.1 2.3 virginica iris_136
## 137 6.3 3.4 5.6 2.4 virginica iris_137
## 138 6.4 3.1 5.5 1.8 virginica iris_138
## 139 6.0 3.0 4.8 1.8 virginica iris_139
## 140 6.9 3.1 5.4 2.1 virginica iris_140
## 141 6.7 3.1 5.6 2.4 virginica iris_141
## 142 6.9 3.1 5.1 2.3 virginica iris_142
## 143 5.8 2.7 5.1 1.9 virginica iris_143
## 144 6.8 3.2 5.9 2.3 virginica iris_144
## 145 6.7 3.3 5.7 2.5 virginica iris_145
## 146 6.7 3.0 5.2 2.3 virginica iris_146
## 147 6.3 2.5 5.0 1.9 virginica iris_147
## 148 6.5 3.0 5.2 2.0 virginica iris_148
## 149 6.2 3.4 5.4 2.3 virginica iris_149
## 150 5.9 3.0 5.1 1.8 virginica iris_150
I simply added a column with identifiers, all of which start by iris_
, followed by a different number. And now you can keep track of every single flower with their own… name in a way. Nice, isn’t it?
When preparing your data set, think carefully:
What are your observations?
What are your variables?
How many observational units do you have?
And remember to add key columns!