Feature space reduction for tag prediction

Question

I am writing a ML module (python) to predict tags for a stackoverflow question (tag + body). My corpus is of around 5 million questions with title, body and tags for each. I'm splitting this 3:2 for training and testing. I'm plagued by the curse of dimensionality.

Work Done

Pre-processing: markup removal, stopword removal, special character removal and a few bits and pieces. Store into MySQL. This almost halves the size of the test data.
ngram association: for each unigram and bigram in the title and the body of each question, I maintain a list of the associated tags. Store into redis. This results in about a million unique unigrams and 20 million unique bigrams, each with a corresponding list of tag frequencies. Ex.
```
"continuous integration": {"ci":42, "jenkins":15, "windows":1, "django":1, ....}
```

Note: There are 2 problems here: a) Not all unigrams and bigrams are important and, b) not all tags associated with a ngram are important, although this doesn't mean that tags with frequency 1 are all equivalent or can be haphazardly removed. The number of tags associated with a given ngram easily runs into the thousands - most of them unrelated and irrelevant.

tfidf: to aid in selecting which ngrams to keep, I calculated the tfidf score for the entire corpus for each unigram and bigram and stored the corresponding idf values with associated tags. Ex.
```
"continuous integration": {"ci":42, "jenkins":15, ...., "__idf__":7.2123}
```
The tfidf scores are stored in a documentxfeature sparse.csr_matrix, and I'm not sure how I can leverage that at the moment. (it is generated by fit_transform())

Questions

How can I use this processed data to reduce the size of my feature set? I've read about SVD and PCA but the examples always talk about a set of documents and a vocabulary. I'm not sure where the tags from my set can come in. Also, the way my data is stored (redis + sparse matrix), it is difficult to use an already implemented module (sklearn, nltk etc) for this task.
Once the feature set is reduced, the way I have planned to use it is as follows:
- Preprocess the test data.
- Find the unigrams and bigrams.
- For the ones stored in redis, find the corresponding best-k tags
- Apply some kind of weight for the title and body text
- Apart from this I might also search for exact known tag matches in the document. Ex, if "ruby-on-rails" occurs in the title/body then its a high probability that it's also a relevant tag.
- Also, for tags predicted with a high probability, I might leverage a tag graph (a undirected graph with tags frequently occurring together having weighted edges between them) to predict more tags.
Are there any suggestions on how to improve upon this? Can a classifier come in handy?

Footnote

I've a 16-core, 16GB RAM machine. The redis-server (which I'll move to a different machine) is stored in RAM and is ~10GB. All the tasks mentioned above (apart from tfidf) are done in parallel using ipython clusters.

Antonio Ercole De Luca · Accepted Answer

Use the public Api of Dandelion, this is a demo.
It extracts concepts from a text, so, in order to reduce dimentionality, you could use those concepts, instead of the bag-of-word paradigm.

Feature space reduction for tag prediction

Tags:

python

machine-learning

scikit-learn

tf-idf

feature-extraction

Work Done

Questions

Footnote

vinayakshukl

1 Answers

Antonio Ercole De Luca

Recent Activity

Donate For Us

Feature space reduction for tag prediction

Tags:

python

machine-learning

scikit-learn

tf-idf

feature-extraction

Work Done

Questions

Footnote

vinayakshukl

1 Answers

Antonio Ercole De Luca

Related questions

Recent Activity

Donate For Us