Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TFIDF for Large Dataset

I have a corpus which has around 8 million news articles, I need to get the TFIDF representation of them as a sparse matrix. I have been able to do that using scikit-learn for relatively lower number of samples, but I believe it can't be used for such a huge dataset as it loads the input matrix into memory first and that's an expensive process.

Does anyone know, what would be the best way to extract out the TFIDF vectors for large datasets?

like image 810
apurva.nandan Avatar asked Aug 05 '14 18:08

apurva.nandan


People also ask

How do I make my TF-IDF faster?

1) Create 2 dictionaries: First dictionary: key (document id), value (list of all found words (incl. repeated) in doc) Second dictionary; key (document id), value (set containing unique words of the doc)

Why TF-IDF is better than count Vectorizer?

TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.

Is TF-IDF sparse or dense?

TfidfVectorizer usually creates sparse data. If the data is sparse enough, matrices usually stays as sparse all along the pipeline until the predictor is trained.


2 Answers

Gensim has an efficient tf-idf model and does not need to have everything in memory at once.

Your corpus simply needs to be an iterable, so it does not need to have the whole corpus in memory at a time.

The make_wiki script runs over Wikipedia in about 50m on a laptop according to the comments.

like image 63
Jonathan Villemaire-Krajden Avatar answered Sep 21 '22 17:09

Jonathan Villemaire-Krajden


I believe you can use a HashingVectorizer to get a smallish csr_matrix out of your text data and then use a TfidfTransformer on that. Storing a sparse matrix of 8M rows and several tens of thousands of columns isn't such a big deal. Another option would be not to use TF-IDF at all- it could be the case that your system works reasonably well without it.

In practice you may have to subsample your data set- sometimes a system will do just as well by just learning from 10% of all available data. This is an empirical question, there is not way to tell in advance what strategy would be best for your task. I wouldn't worry about scaling to 8M document until I am convinced I need them (i.e. until I have seen a learning curve showing a clear upwards trend).

Below is something I was working on this morning as an example. You can see the performance of the system tends to improve as I add more documents, but it is already at a stage where it seems to make little difference. Given how long it takes to train, I don't think training it on 500 files is worth my time.

like image 40
mbatchkarov Avatar answered Sep 19 '22 17:09

mbatchkarov