Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Classifier and Technique to use for large number of categories

I am designing a scikit learn classifier which has 5000+ categories and training data is at least 80 million and may grow upto an additional 100 million each year. I have already tried with all the categories but it generates classifiers in the order of 100s of GBs binary file. So I think that having one classifier for each category would be helpful and would also help me to fine tune features for each category thereby improving accuracy, but this means 5k+ classifiers for each of these categories. So how to handle this large data requirements and which incremental classifiers to use for this case , considering the fact that I will keep on getting additional training data as well as may discover new categories?

Update :

The number of features are about 45 which are mostly text based and most are categorical with text based values with large cardinality i.e many features may have huge number of possible values and available RAM IS 32gb with 8 core CPU. I tried Multinomial NB and linear SGD with sparse matrices which are extremely sparse. Used the scikit learns Dictvectorizer to vectorize the feature dictionary. Also will pandas dataframes help to optimize the overall configuration?

like image 608
stackit Avatar asked Nov 26 '25 07:11

stackit


1 Answers

To sum up our discussion:

Incremental classifiers

"Incremental" classifiers are good candidates when you need to do out-of-core learning (i.e all your data does not fit in memory).
For classification in scikit-learn you mentionned MultinomialNB and SGDClassifier, which are the two main classifiers that implement the partial_fit api.

For your purposes it seems like an online learning algorithm would be perfect. You can look into VowpalWabbit if you want to go that way. I had a chance to use it for a similar problem (6k+ classes) and the models were way lighter than 100GBs. I don't recall the exact size but I was able to store a bunch of them on my personal computer ;).

Note that documentation for VW is a bit scarce (nothing like scikit-learn) and you'll probably have to read some papers if you have a sophisticated use case. Here's a good tutorial to get started.

Size of the pickled model

Your entire pickled pipeline is in the order of 100GBs, this looks huge to me. I'd advise pickling each step separately as a way to profile the issue.
Sometimes you can drop some attributes before you pickle the estimators. An example is stop_words_ for a TfidfVectorizer (see the docs).
If the steps are storing large numpy arrays, joblib.dump (from sklearn.externals import joblib) can be a more memory-efficient alternative to pickle.

Training many binary classifiers

You probably don't want the overhead of having to care for 5k+ classifiers yourself. What you are describing is a One Versus All strategy to performing multiclass classification.
Note that when using LogisticRegression or SGDClassifier this is already how the multiclass problem is being solved.

Conclusion

I'd say VowpalWabbit looks like a perfect fit, but there might be other tools out there for your use case.

For your last point: pandas won't help on making lighter models, it's a great library to manipulate/transform the data though.

like image 134
ldirer Avatar answered Nov 28 '25 21:11

ldirer



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!