Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Feature space reduction for tag prediction

I am writing a ML module (python) to predict tags for a stackoverflow question (tag + body). My corpus is of around 5 million questions with title, body and tags for each. I'm splitting this 3:2 for training and testing. I'm plagued by the curse of dimensionality.


Work Done

  1. Pre-processing: markup removal, stopword removal, special character removal and a few bits and pieces. Store into MySQL. This almost halves the size of the test data.
  2. ngram association: for each unigram and bigram in the title and the body of each question, I maintain a list of the associated tags. Store into redis. This results in about a million unique unigrams and 20 million unique bigrams, each with a corresponding list of tag frequencies. Ex.

    "continuous integration": {"ci":42, "jenkins":15, "windows":1, "django":1, ....}
    

Note: There are 2 problems here: a) Not all unigrams and bigrams are important and, b) not all tags associated with a ngram are important, although this doesn't mean that tags with frequency 1 are all equivalent or can be haphazardly removed. The number of tags associated with a given ngram easily runs into the thousands - most of them unrelated and irrelevant.

  1. tfidf: to aid in selecting which ngrams to keep, I calculated the tfidf score for the entire corpus for each unigram and bigram and stored the corresponding idf values with associated tags. Ex.

    "continuous integration": {"ci":42, "jenkins":15, ...., "__idf__":7.2123}
    

    The tfidf scores are stored in a documentxfeature sparse.csr_matrix, and I'm not sure how I can leverage that at the moment. (it is generated by fit_transform())


Questions

  1. How can I use this processed data to reduce the size of my feature set? I've read about SVD and PCA but the examples always talk about a set of documents and a vocabulary. I'm not sure where the tags from my set can come in. Also, the way my data is stored (redis + sparse matrix), it is difficult to use an already implemented module (sklearn, nltk etc) for this task.
  2. Once the feature set is reduced, the way I have planned to use it is as follows:

    • Preprocess the test data.
    • Find the unigrams and bigrams.
    • For the ones stored in redis, find the corresponding best-k tags
    • Apply some kind of weight for the title and body text
    • Apart from this I might also search for exact known tag matches in the document. Ex, if "ruby-on-rails" occurs in the title/body then its a high probability that it's also a relevant tag.
    • Also, for tags predicted with a high probability, I might leverage a tag graph (a undirected graph with tags frequently occurring together having weighted edges between them) to predict more tags.

    Are there any suggestions on how to improve upon this? Can a classifier come in handy?


Footnote

I've a 16-core, 16GB RAM machine. The redis-server (which I'll move to a different machine) is stored in RAM and is ~10GB. All the tasks mentioned above (apart from tfidf) are done in parallel using ipython clusters.

like image 442
vinayakshukl Avatar asked Jan 31 '15 18:01

vinayakshukl


1 Answers

Use the public Api of Dandelion, this is a demo.
It extracts concepts from a text, so, in order to reduce dimentionality, you could use those concepts, instead of the bag-of-word paradigm.

like image 197
Antonio Ercole De Luca Avatar answered Oct 05 '22 13:10

Antonio Ercole De Luca