I am writing a ML module (python) to predict tags for a stackoverflow question (tag + body). My corpus is of around 5 million questions with title, body and tags for each. I'm splitting this 3:2 for training and testing. I'm plagued by the curse of dimensionality.
ngram association: for each unigram and bigram in the title and the body of each question, I maintain a list of the associated tags. Store into redis. This results in about a million unique unigrams and 20 million unique bigrams, each with a corresponding list of tag frequencies. Ex.
"continuous integration": {"ci":42, "jenkins":15, "windows":1, "django":1, ....}
Note: There are 2 problems here: a) Not all unigrams and bigrams are important and, b) not all tags associated with a ngram are important, although this doesn't mean that tags with frequency 1 are all equivalent or can be haphazardly removed. The number of tags associated with a given ngram easily runs into the thousands - most of them unrelated and irrelevant.
tfidf: to aid in selecting which ngrams to keep, I calculated the tfidf score for the entire corpus for each unigram and bigram and stored the corresponding idf values with associated tags. Ex.
"continuous integration": {"ci":42, "jenkins":15, ...., "__idf__":7.2123}
The tfidf scores are stored in a documentxfeature
sparse.csr_matrix, and I'm not sure how I can leverage that at the moment. (it is generated by fit_transform())
Once the feature set is reduced, the way I have planned to use it is as follows:
Are there any suggestions on how to improve upon this? Can a classifier come in handy?
I've a 16-core, 16GB RAM machine. The redis-server (which I'll move to a different machine) is stored in RAM and is ~10GB. All the tasks mentioned above (apart from tfidf) are done in parallel using ipython clusters.
Use the public Api of Dandelion, this is a demo.
It extracts concepts from a text, so, in order to reduce dimentionality, you could use those concepts, instead of the bag-of-word paradigm.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With