I am doing some text analysis work in Python. Unfortunately, I need to switch to R in order to use a particular package (unfortunately, the package cannot be replicated in Python easily).
Currently the text is parsed into bigram counts, reduced to a vocabulary of about 11,000 bigrams, and then stored as a dictionary:
{id1: {'bigrams':[(bigram1, count), (bigram2, count), ...]},
id2: {'bigrams': ...}
I need to get this into a dgCMatrix in R, where the rows are id1, id2, ... and the columns are the different bigrams such that a cell represents the 'count' for that id-bigram.
Any suggestions? I thought about expanding it just to a massive CSV, but that seems super inefficient plus probably infeasible due to memory constraints.
Could you could write out the matrix in MatrixMarket format using scipy mmwrite and then read it into R using readMM from the Matrix package?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With