Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using sklearn and Python for a large application classification/scraping exercise

I am working on a relatively large text-based web classification problem and I am planning on using the multinomial Naive Bayes classifier in sklearn in python and the scrapy framework for the crawling. However, I am a little concerned that sklearn/python might be too slow for a problem that could involve classifications of millions of websites. I have already trained the classifier on several thousand websites from DMOZ. The research framework is as follows:

1) The crawler lands on a domain name and scrapes the text from 20 links on the site (of depth no larger than one). (The number of tokenized words here seems to vary between a few thousand to up to 150K for a sample run of the crawler) 2) Run the sklearn multionmial NB classifier with around 50,000 features and record the domain name depending on the result

My question is whether a Python-based classifier would be up to the task for such a large scale application or should I try re-writing the classifier (and maybe the scraper and word tokenizer as well) in a faster environment? If yes what might that environment be? Or perhaps Python is enough if accompanied with some parallelization of the code? Thanks

like image 250
gpanterov Avatar asked Apr 13 '13 15:04

gpanterov


2 Answers

Use the HashingVectorizer and one of the linear classification modules that supports the partial_fit API for instance SGDClassifier, Perceptron or PassiveAggresiveClassifier to incrementally learn the model without having to vectorize and load all the data in memory upfront and you should not have any issue in learning a classifier on hundreds of millions of documents with hundreds of thousands (hashed) features.

You should however load a small subsample that fits in memory (e.g. 100k documents) and grid search good parameters for the vectorizer using a Pipeline object and the RandomizedSearchCV class of the master branch. You can also fine tune the value of the regularization parameter (e.g. C for PassiveAggressiveClassifier or alpha for SGDClassifier) using the same RandomizedSearchCVor a larger, pre-vectorized dataset that fits in memory (e.g. a couple of millions of documents).

Also linear models can be averaged (average the coef_ and intercept_ of 2 linear models) so that you can partition the dataset, learn linear models independently and then average the models to get the final model.

like image 95
ogrisel Avatar answered Nov 14 '22 21:11

ogrisel


Fundamentally, if you rely on numpy, scipy, and sklearn, Python will not be a bottleneck as most critical portions of those libraries are implemented as C-extensions.

But, since you're scraping millions of sites, you're going to be bounded by your single machine's capabilities. I would consider using a service like PiCloud [1] or Amazon Web Services (EC2) to distribute your workload across many servers.

An example would be to funnel your scraping through Cloud Queues [2].

[1] http://www.picloud.com

[2] http://blog.picloud.com/2013/04/03/introducing-queues-creating-a-pipeline-in-the-cloud/

like image 3
jriley Avatar answered Nov 14 '22 22:11

jriley