Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert dfmSparse from Quanteda package to Data Frame or Data Table in R

I have a dfmSparse object (large, with 2.1GB) which is tokenized and with ngrams (unigrams, bigrams, trigrams and fourgrams), and I want to convert it to a data frame or a data table object with the columns: Content and Frequency.

I tried to unlist... but didn't work. I'm new in NLP, and I don't know with method to use, I'm without ideas and didn't found a solution here or with Google.

Some info about the data:

>str(tokfreq)
Formal class 'dfmSparse' [package "quanteda"] with 11 slots
  ..@ settings    :List of 1
  .. ..$ : NULL
  ..@ weighting   : chr "frequency"
  ..@ smooth      : num 0
  ..@ ngrams      : int [1:4] 1 2 3 4
  ..@ concatenator: chr "_"
  ..@ Dim         : int [1:2] 167500 19765478
  ..@ Dimnames    :List of 2
  .. ..$ docs    : chr [1:167500] "character(0).content" "character(0).content" "character(0).content" "character(0).content" ...
  .. ..$ features: chr [1:19765478] "add" "lime" "juice" "tequila" ...
  ..@ i           : int [1:54488417] 0 75 91 178 247 258 272 327 371 391 ...
  ..@ p           : int [1:19765479] 0 3218 3453 4015 4146 4427 4637 140665 140736 142771 ...
  ..@ x           : num [1:54488417] 1 1 1 1 5 1 1 1 1 1 ...
  ..@ factors     : list()

>summary(tokfreq)
       Length         Class          Mode 
3310717565000     dfmSparse            S4

Thanks!

EDITED: This is how I created the dataset from a corpus:

# tokenize
tokenized <- tokenize(x = teste, ngrams = 1:4)
# Creating the dfm
tokfreq <- dfm(x = tokenized)
like image 710
Diego Gaona Avatar asked Mar 12 '23 19:03

Diego Gaona


1 Answers

This should do it, if I understood your question about what you mean by "Content" and "Frequency". Note that in this approach, the data.frame is not larger than the sparse matrix, since you are just recording total counts, and not storing the document row distributions.

myDfm <- dfm(data_corpus_inaugural, ngrams = 1:4, verbose = FALSE)
head(myDfm)
## Document-feature matrix of: 57 documents, 314,224 features.
## (showing first 6 documents and first 6 features)
##                  features
## docs              fellow-citizens  of the senate and house
##   1789-Washington               1  71 116      1  48     2
##   1793-Washington               0  11  13      0   2     0
##   1797-Adams                    3 140 163      1 130     0
##   1801-Jefferson                2 104 130      0  81     0
##   1805-Jefferson                0 101 143      0  93     0
##   1809-Madison                  1  69 104      0  43     0

# convert to a data.frame
df <- data.frame(Content = featnames(myDfm), Frequency = colSums(myDfm), 
                 row.names = NULL, stringsAsFactors = FALSE)
head(df)
##           Content Frequency
## 1 fellow-citizens        39
## 2              of      7055
## 3             the     10011
## 4          senate        15
## 5             and      5233
## 6           house        11
tail(df)
##                           Content Frequency
## 314219         and_may_he_forever         1
## 314220       may_he_forever_bless         1
## 314221     he_forever_bless_these         1
## 314222 forever_bless_these_united         1
## 314223  bless_these_united_states         1
## 314224     these_united_states_of         1    

object.size(df)
## 25748240 bytes
object.size(myDfm)
## 29463592 bytes

Added 2018-02-25

In quanteda >= 1.0.0 there is a function textstat_frequency() that will produce the data.frame that you want, e.g.

textstat_frequency(data_dfm_lbgexample) %>% head()
#   feature frequency rank docfreq group
# 1       P       356    1       5   all
# 2       O       347    2       4   all
# 3       Q       344    3       5   all
# 4       N       317    4       4   all
# 5       R       316    5       4   all
# 6       S       280    6       4   all
like image 188
Ken Benoit Avatar answered Mar 15 '23 13:03

Ken Benoit