TFIDF for Large Dataset

Tags:

I have a corpus which has around 8 million news articles, I need to get the TFIDF representation of them as a sparse matrix. I have been able to do that using scikit-learn for relatively lower number of samples, but I believe it can't be used for such a huge dataset as it loads the input matrix into memory first and that's an expensive process.

Does anyone know, what would be the best way to extract out the TFIDF vectors for large datasets?

810

asked Aug 05 '14 18:08

apurva.nandan

2 Answers

Gensim has an efficient tf-idf model and does not need to have everything in memory at once.

Your corpus simply needs to be an iterable, so it does not need to have the whole corpus in memory at a time.

The make_wiki script runs over Wikipedia in about 50m on a laptop according to the comments.

answered Sep 21 '22 17:09

Jonathan Villemaire-Krajden

I believe you can use a HashingVectorizer to get a smallish csr_matrix out of your text data and then use a TfidfTransformer on that. Storing a sparse matrix of 8M rows and several tens of thousands of columns isn't such a big deal. Another option would be not to use TF-IDF at all- it could be the case that your system works reasonably well without it.

In practice you may have to subsample your data set- sometimes a system will do just as well by just learning from 10% of all available data. This is an empirical question, there is not way to tell in advance what strategy would be best for your task. I wouldn't worry about scaling to 8M document until I am convinced I need them (i.e. until I have seen a learning curve showing a clear upwards trend).

Below is something I was working on this morning as an example. You can see the performance of the system tends to improve as I add more documents, but it is already at a stage where it seems to make little difference. Given how long it takes to train, I don't think training it on 500 files is worth my time.

answered Sep 19 '22 17:09

mbatchkarov

Related questions
                            
                                How do you organise a python project that contains multiple packages so that each file in a package can still be run individually?
                            
                                What path to install Python 3.6 to on Windows?
                            
                                What is the effect of "list=list" in Python modules?
                            
                                On what CPU cores are my Python processes running?
                            
                                IOError: request data read error
                            
                                Setting up setup.py for packaging of a single .py file and a single data file without needing to create any folders
                            
                                Setting variables with exec inside a function
                            
                                What's the best way to distribute python command-line tools?
                            
                                Default sub-command, or handling no sub-command with argparse
                            
                                Python dynamic inheritance: How to choose base class upon instance creation?
                            
                                Difference between frompyfunc and vectorize in numpy
                            
                                LSTM Autoencoder
                            
                                how to reverse the URL of a ViewSet's custom action in django restframework
                            
                                Why is the compiler package discontinued in Python 3?
                            
                                Use pdb.set_trace() in a script that reads stdin via a pipe
                            
                                Is it possible to vectorize recursive calculation of a NumPy array where each element depends on the previous one?
                            
                                Break on unhandled exception in pycharm
                            
                                Who runs the callback when using apply_async method of a multiprocessing pool?
                            
                                Python logging configuration file
                            
                                Why is 2 * x * x faster than 2 * ( x * x ) in Python 3.x, for integers?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

TFIDF for Large Dataset

Tags:

python

lucene

nlp

scikit-learn

tf-idf

apurva.nandan

People also ask

2 Answers

Jonathan Villemaire-Krajden

mbatchkarov

Recent Activity

Donate For Us