We’re mostly used to deal with formatted or rich text (binary text is a term also used): that’s what we have in Word, Powerpoint, etc. Plain text, however, contains only text, that is, only readable charachters. It has no format, no fonts, no colour… Just characters. That’s what we most often need when programming. Plain text uses the .txt extension. We can read plain text with just about any word processor, like Microsoft Word, Pages, Open Office… However, because word processors tend to do things to text that you don’t see and that may make your life harder, using a specific text editor is recommended. Source-code editors also work and, since you’re learning how to program, they can be a good idea. I’m quite happy with Visual Studio Code, in case you want to give it a try (it’s not hard to use, but I recommend watching a tutorial).
We can also use plain text to represent tables (i.e. delimited text). Delimited text files use characters to indicate a structure of columns and rows. Rows are typically delimited by new lines. R (and any program used to read a delimited text file) assumes that that is the case and you don’t need to specify anything. However, columns might be delimited by a series of different characters. The most common ones are:
commas ,
tabs (keep in mind that its escape code, that is, the sequence of characters signaling the presence of a tab, is \t
),
semicolons.
Normally, you need to tell R (or any other program) how your rows are delimited so that it can properly read your table. The extensions .csv (comma separated values) and .tsv (tab separated values) are often not good indicators of these, because you can have a .csv file encoded by tabs or semicolons and vice versa (confusing, I know). (When you save a .csv file in Excel in a Mac, it typically uses semicolons, for instance.) The good news is that this problem is easily solved: if you’re reading delimited text and the result does not look like a table, you will see what delimiters are being used: check the result and read/open the file again setting the proper delimiter! In this course, we will be using tabs as delimiters most of the time.
Text is composed of characters (letters, punctuation…). And it’s important that you know that several sets of characters with different encoding exist. Your computer needs to know the text encoding in order to properly read it. Some common encodings are: 1. ASCII: the first that was created, it has the simplest Latin characters: no accents, no tildes… 2. ISO-8859_1 (Latin1): this one does have those complex characters most European languages have and is the encoding that you might find in texts that were created a while ago. 3. UTF-8 (Unicode standard): this is the one that most people are using right now. It has the most comprehensive character set and is the universal standard. It’s the encoding that you will most likely find in modern texts. ASCII is a subset of UTF-8, so you can read any text encoded in ASCII also using a UTF-8 encoding. 4. Windows_1252 / CP_1252: At some point, someone thought that creating very specific encoding systems was a good idea. It was not and that’s why we have a universal standard right now, but you might find some texts in encoding that is specific for Windows, Mac…
Talking about different operative systems… It’s good to now that new lines (that is, line endings, the end of the line (EOL), also called line feed or line break… Whatever happens when you press enter
) are treated differently in Windows and Linux / Mac. In Windows, new lines are encoded as a double sequence: \r\n
(which stand for carriage return + line feed). In Linux and Mac the are simply encoded as \n
. Good to know so that you can be careful when you convert files between systems!
As you probably know, there is a lot of debate about what a word is in linguistics. But there is agreement that a graphic definition (that is, a word is whatever we separate with spaces in writing) is not appropriate. This is because we have contractions (en. cannot
and can’t
for can not
, fr. des
for de les
, sp. al
for a el
); compound words with variable spelling (de. Kaffee-Ersatz
or Kaffeeersatz
, en. pigeon-hole
or pigeonhole
…), etc. Of course, computational tools use… a graphic definition of word! That is, they consider that words are sequences of characters (numbers, apostrophes and underscores included) delimited by spaces or punctuation marks.
As you know, linguistic concepts such as sentences or phrases are based on structural reasons and not on contiguity. We know that linguistic structure is not linear! However, text is. In computational text analysis, the basic concept of sentence is 1) linear and 2) again based on graphic criteria. Thus, sentences are sequences of characters separated by a set of punctuation marks, such as the full stop, the question mark and the exclamation mark. In Rosemary came in, looked at the guy that was sitting in the corner and realised that that was the perfect moment to bring it up.
R will identify one single sentence.
Another category that is very often used in text analysis is n-grams, which is again based on linearity. N-grams are “contiguous sequences of n items from a given sample of text”. Grams (=items) can be words, letters, syllables… It just depends on what you’re interested in. We will be mostly using words as grams. You can have 1-grams (unigrams), 2-grams (bigrams), 3-grams (trigrams)… In the sentence mentioned above, our first few 2-grams would be: Rosemary came
, came in
, in looked
, looked at
, at the
… Note that words overlap from 2-gram to 2-gram, because we take every single word to be the first element of a 2-gram!
That is, the concepts of word and sentence differ from linguistics to computational text analysis and n-grams do not correspond to a linguistic concept. However, the computational concepts are useful for linguists too, because even if they are not perfect, they can be good approximations. You might have heard about “noise” in the data. Noisy data are not perfect, but, if they are not too far from perfection, the results of analyzing noisy data can be good enough. When using large amounts of data, i.e. data that we could not possibly analyze by hand, we need to accept a trade-off between a reasonable amount of noise (the con) and having lots of data (the pro). What does reasonable mean in this context? Well, the researcher is who should decide, by checking a sample of the data and assessing how noisy it is. That is why, as linguists, it is crucial that we are aware of these simplifications, so that we can deal with it if it is relevant for our studies.
The distinction between types and tokens is a quite relevant one, both in linguistics and in computational analysis. Types refer to classes, while tokens refer to individuals. For instance, the number of distinct words within a text refers to type frequency, while the total word count within a text refers to token frequency. Again, remember that here “word” refers to the computational concept! In linguistics it normally makes sense to consider all inflected forms of a word together, for instance. To do that computationally, we will need a lemmatized text… We’ll talk about that in further sessions.
Again, a non-linguistic concept. Stopwords are words that are filtered out during processing text. While you can decided which words you want to filter out depending on the purpose of your analysis, stop words are typically the most frequent words of a language (i.e. function or grammar words). This makes sense if you want to understand the content of a text, for instance. Grammar words do not give much information about that. However, if your interested in, let’s say, how prepositions are distributed in a given text… Well, then you do not want to filter prepositions out! That is, filtering stop words out depends on the purpose of your analysis.
We’ll also talk about other textual structures (such as collocations) in a few sessions.