I'm working with large, sparse matrices (document-feature matrices generated from text) in python. It's taking quite a bit of processing time and memory to chew through these, and I imagine that sparse matrices could offer some improvements. But I'm worried that using a sparse matrix library is going to make it harder to plug into other python (and R, through rpy2) modules.
Can people who've crossed this bridge already offer some advice? What are the pros and cons of using sparse matrices in python/R, in terms of performance, scalability, and compatibility?
Using sparse matrices in Python might not be a great idea in itself. Have you checked out sparse matrices in numpy / scipy?
Numpy brings the immense benefit of using mainly C code to provide performance gains in Python.
From my limited experience of doing text processing in R, the performance makes it pretty much unusable for anything beyond exploratory data analysis.
Regardless, you shouldn't be using vanilla lists for sparse matrices, it will (understandably) take a while to chew through them.
There are several ways to represent sparse matrices (documentation for the R SparseM package reports 20 different ways to store sparse matrix data), so complete compatibility with all solutions is probably out of question. The number options also suggests that there is no best-in-all-situations solution.
Pick either the numpy sparse matrices or R's SparseM (through rpy2) according to where your heavy number crunching routines on those matrices are found (numpy or R).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With