I am working on a relatively large text-based web classification problem and I am planning on using the multinomial Naive Bayes classifier in sklearn in python and the scrapy framework for the crawling. However, I am a little concerned that sklearn/python might be too slow for a problem that could involve classifications of millions of websites. I have already trained the classifier on several thousand websites from DMOZ. The research framework is as follows:
1) The crawler lands on a domain name and scrapes the text from 20 links on the site (of depth no larger than one). (The number of tokenized words here seems to vary between a few thousand to up to 150K for a sample run of the crawler) 2) Run the sklearn multionmial NB classifier with around 50,000 features and record the domain name depending on the result
My question is whether a Python-based classifier would be up to the task for such a large scale application or should I try re-writing the classifier (and maybe the scraper and word tokenizer as well) in a faster environment? If yes what might that environment be? Or perhaps Python is enough if accompanied with some parallelization of the code? Thanks
Use the HashingVectorizer
and one of the linear classification modules that supports the partial_fit
API for instance SGDClassifier
, Perceptron
or PassiveAggresiveClassifier
to incrementally learn the model without having to vectorize and load all the data in memory upfront and you should not have any issue in learning a classifier on hundreds of millions of documents with hundreds of thousands (hashed) features.
You should however load a small subsample that fits in memory (e.g. 100k documents) and grid search good parameters for the vectorizer using a Pipeline object and the RandomizedSearchCV
class of the master branch. You can also fine tune the value of the regularization parameter (e.g. C for PassiveAggressiveClassifier or alpha for SGDClassifier) using the same RandomizedSearchCV
or a larger, pre-vectorized dataset that fits in memory (e.g. a couple of millions of documents).
Also linear models can be averaged (average the coef_
and intercept_
of 2 linear models) so that you can partition the dataset, learn linear models independently and then average the models to get the final model.
Fundamentally, if you rely on numpy, scipy, and sklearn, Python will not be a bottleneck as most critical portions of those libraries are implemented as C-extensions.
But, since you're scraping millions of sites, you're going to be bounded by your single machine's capabilities. I would consider using a service like PiCloud [1] or Amazon Web Services (EC2) to distribute your workload across many servers.
An example would be to funnel your scraping through Cloud Queues [2].
[1] http://www.picloud.com
[2] http://blog.picloud.com/2013/04/03/introducing-queues-creating-a-pipeline-in-the-cloud/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With