I have a dfmSparse object (large, with 2.1GB) which is tokenized and with ngrams (unigrams, bigrams, trigrams and fourgrams), and I want to convert it to a data frame or a data table object with the columns: Content and Frequency.
I tried to unlist... but didn't work. I'm new in NLP, and I don't know with method to use, I'm without ideas and didn't found a solution here or with Google.
Some info about the data:
>str(tokfreq)
Formal class 'dfmSparse' [package "quanteda"] with 11 slots
..@ settings :List of 1
.. ..$ : NULL
..@ weighting : chr "frequency"
..@ smooth : num 0
..@ ngrams : int [1:4] 1 2 3 4
..@ concatenator: chr "_"
..@ Dim : int [1:2] 167500 19765478
..@ Dimnames :List of 2
.. ..$ docs : chr [1:167500] "character(0).content" "character(0).content" "character(0).content" "character(0).content" ...
.. ..$ features: chr [1:19765478] "add" "lime" "juice" "tequila" ...
..@ i : int [1:54488417] 0 75 91 178 247 258 272 327 371 391 ...
..@ p : int [1:19765479] 0 3218 3453 4015 4146 4427 4637 140665 140736 142771 ...
..@ x : num [1:54488417] 1 1 1 1 5 1 1 1 1 1 ...
..@ factors : list()
>summary(tokfreq)
Length Class Mode
3310717565000 dfmSparse S4
Thanks!
EDITED: This is how I created the dataset from a corpus:
# tokenize
tokenized <- tokenize(x = teste, ngrams = 1:4)
# Creating the dfm
tokfreq <- dfm(x = tokenized)
This should do it, if I understood your question about what you mean by "Content" and "Frequency". Note that in this approach, the data.frame is not larger than the sparse matrix, since you are just recording total counts, and not storing the document row distributions.
myDfm <- dfm(data_corpus_inaugural, ngrams = 1:4, verbose = FALSE)
head(myDfm)
## Document-feature matrix of: 57 documents, 314,224 features.
## (showing first 6 documents and first 6 features)
## features
## docs fellow-citizens of the senate and house
## 1789-Washington 1 71 116 1 48 2
## 1793-Washington 0 11 13 0 2 0
## 1797-Adams 3 140 163 1 130 0
## 1801-Jefferson 2 104 130 0 81 0
## 1805-Jefferson 0 101 143 0 93 0
## 1809-Madison 1 69 104 0 43 0
# convert to a data.frame
df <- data.frame(Content = featnames(myDfm), Frequency = colSums(myDfm),
row.names = NULL, stringsAsFactors = FALSE)
head(df)
## Content Frequency
## 1 fellow-citizens 39
## 2 of 7055
## 3 the 10011
## 4 senate 15
## 5 and 5233
## 6 house 11
tail(df)
## Content Frequency
## 314219 and_may_he_forever 1
## 314220 may_he_forever_bless 1
## 314221 he_forever_bless_these 1
## 314222 forever_bless_these_united 1
## 314223 bless_these_united_states 1
## 314224 these_united_states_of 1
object.size(df)
## 25748240 bytes
object.size(myDfm)
## 29463592 bytes
Added 2018-02-25
In quanteda >= 1.0.0 there is a function textstat_frequency()
that will produce the data.frame that you want, e.g.
textstat_frequency(data_dfm_lbgexample) %>% head()
# feature frequency rank docfreq group
# 1 P 356 1 5 all
# 2 O 347 2 4 all
# 3 Q 344 3 5 all
# 4 N 317 4 4 all
# 5 R 316 5 4 all
# 6 S 280 6 4 all
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With