quanteda is an R package to perform a variety of natural language processing tasks: corpus management, tokenization, analysis, visualization.
→ quanteda is to text analysis what dplyr and tidyr are to data wrangling
In quanteda, text is processed as:
corpus → after having converted text data in corpus format (through the corpus function) we can work on it with quanteda; a corpus holds documents separately from each other.
tokens → usually each word in a text, but also single characters or sentences if we want
document-feature matrix (“dfm”) → the analytical unit in which perform analysis; text documents are organized in matrices, with original texts as rows and features as columns. “Features” are more generally defined than “terms”, as they may be raw terms, stemmed terms, terms without stopwords, etc.
Source: MZES Social Science Data Lab, 2021
quanteda is built to be faster and more efficient than any other R or Python package for processing large textual data. Infrastructure on three main pillars:
stringi package for text processing
Matrix package for sparse matrix objects
computationally intensive processing (e.g. for tokens) handled in parallelized C++
Intuitive, powerful, and flexible
Now lets see text preprocessing workflow and some functions!
library(readtext) # companion package to Quanteda to read text (.txt) files or comma-separated-value (.csv) files
library(quanteda) # for making a corpus and the rest of our text processingausten_texts = readtext("../data/Austen_texts/*.txt",
docvarsfrom= "filenames", dvsep = "_",
docvarnames = c("Author", "Book"))
austen_textsbook_corpus = corpus(austen_texts)
summary(book_corpus)text <- c("I <3 little pumpkins! OMG they're so cute.")
tokens(text, what = "sentence")## Tokens consisting of 1 document.
## text1 :
## [1] "I <3 little pumpkins!" "OMG they're so cute."
tokens(text, what = "character")## Tokens consisting of 1 document.
## text1 :
## [1] "I" "<" "3" "l" "i" "t" "t" "l" "e" "p" "u" "m"
## [ ... and 23 more ]
text_token = tokens(text, what = "word") #default
text_token## Tokens consisting of 1 document.
## text1 :
## [1] "I" "<" "3" "little" "pumpkins" "!"
## [7] "OMG" "they're" "so" "cute" "."
tokens(text,
remove_punct = TRUE,
remove_numbers = FALSE,
remove_symbols = TRUE)## Tokens consisting of 1 document.
## text1 :
## [1] "I" "3" "little" "pumpkins" "OMG" "they're" "so"
## [8] "cute"
# char_tolower(text) before tokenizing
tokens_tolower(text_token) #after tokenizing## Tokens consisting of 1 document.
## text1 :
## [1] "i" "<" "3" "little" "pumpkins" "!"
## [7] "omg" "they're" "so" "cute" "."
#char_tolower(text, keep_acronyms = TRUE) before tokenizing
tokens_tolower(text_token, keep_acronyms = TRUE)## Tokens consisting of 1 document.
## text1 :
## [1] "i" "<" "3" "little" "pumpkins" "!"
## [7] "OMG" "they're" "so" "cute" "."
tokens_remove(text_token, stopwords("en"))## Tokens consisting of 1 document.
## text1 :
## [1] "<" "3" "little" "pumpkins" "!" "OMG" "cute"
## [8] "."
tokens_remove(text_token, c("pumpkins", "cute"))## Tokens consisting of 1 document.
## text1 :
## [1] "I" "<" "3" "little" "!" "OMG" "they're"
## [8] "so" "."
dip_text = c("When I dip, he dips, we are dipping.")
dip_token = tokens(dip_text)
tokens_wordstem(dip_token)## Tokens consisting of 1 document.
## text1 :
## [1] "When" "I" "dip" "," "he" "dip" "," "we" "are" "dip"
## [11] "."
lr_text = c("Three Rings for the Elven-kings under the sky,
Seven for the Dwarf-lords in their halls of stone,
Nine for Mortal Men doomed to die,
One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie.
One Ring to rule them all, One Ring to find them,
One Ring to bring them all, and in the darkness bind them,
In the Land of Mordor where the Shadows lie.")
lr_token = tokens(lr_text)
kwic(lr_token, "Ring", window=2)sample_token = tokens(data_corpus_inaugural)
sample_dfm = dfm(sample_token)
sample_dfm## Document-feature matrix of: 59 documents, 9,439 features (91.84% sparse) and 4 docvars.
## features
## docs fellow-citizens of the senate and house representatives :
## 1789-Washington 1 71 116 1 48 2 2 1
## 1793-Washington 0 11 13 0 2 0 0 1
## 1797-Adams 3 140 163 1 130 0 2 0
## 1801-Jefferson 2 104 130 0 81 0 0 1
## 1805-Jefferson 0 101 143 0 93 0 0 0
## 1809-Madison 1 69 104 0 43 0 0 0
## features
## docs among vicissitudes
## 1789-Washington 1 1
## 1793-Washington 0 0
## 1797-Adams 4 0
## 1801-Jefferson 1 0
## 1805-Jefferson 7 0
## 1809-Madison 0 0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,429 more features ]
sample_dfm[1:5,1:7]## Document-feature matrix of: 5 documents, 7 features (34.29% sparse) and 4 docvars.
## features
## docs fellow-citizens of the senate and house representatives
## 1789-Washington 1 71 116 1 48 2 2
## 1793-Washington 0 11 13 0 2 0 0
## 1797-Adams 3 140 163 1 130 0 2
## 1801-Jefferson 2 104 130 0 81 0 0
## 1805-Jefferson 0 101 143 0 93 0 0
dfm_trim(sample_dfm, min_termfreq = 30, max_termfreq = 100)## Document-feature matrix of: 59 documents, 360 features (56.04% sparse) and 4 docvars.
## features
## docs fellow-citizens could greater order day present hand whose
## 1789-Washington 1 3 1 2 2 5 3 2
## 1793-Washington 0 0 0 0 0 1 0 0
## 1797-Adams 3 1 0 4 1 2 0 0
## 1801-Jefferson 2 0 1 1 1 0 0 2
## 1805-Jefferson 0 2 0 3 1 3 0 3
## 1809-Madison 1 1 0 0 0 1 0 2
## features
## docs love hopes
## 1789-Washington 2 1
## 1793-Washington 0 0
## 1797-Adams 5 0
## 1801-Jefferson 2 1
## 1805-Jefferson 1 0
## 1809-Madison 0 1
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 350 more features ]
topfeatures(sample_dfm, 5)## the of , and .
## 10183 7180 7173 5406 5155
Quanteda automatically loads stop words from the “Stopwords” package with the “Snowball” collection of 15 languages
library(stopwords)
stopwords_getsources() #to see other sources with different languages ## [1] "snowball" "stopwords-iso" "misc" "smart"
## [5] "marimo" "ancient" "nltk" "perseus"
stopwords_getlanguages("snowball") #to see language options of sources## [1] "da" "de" "en" "es" "fi" "fr" "hu" "ir" "it" "nl" "no" "pt" "ro" "ru" "sv"
it_poem = ("E chiese al vecchio dammi il pane
Ho poco tempo e troppa fame
E chiese al vecchio dammi il vino
Ho sete e sono un assassino
Gli occhi dischiuse il vecchio al giorno
Non si guardò neppure intorno
Ma versò il vino e spezzò il pane
Per chi diceva ho sete e ho fame")
tokens(it_poem) %>%
tokens_remove(stopwords("it"))## Tokens consisting of 1 document.
## text1 :
## [1] "chiese" "vecchio" "dammi" "pane" "poco" "tempo" "troppa"
## [8] "fame" "chiese" "vecchio" "dammi" "vino"
## [ ... and 16 more ]