Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using scikit-learn vectorizers and vocabularies with gensim

I am trying to recycle scikit-learn vectorizer objects with gensim topic models. The reasons are simple: first of all, I already have a great deal of vectorized data; second, I prefer the interface and flexibility of scikit-learn vectorizers; third, even though topic modelling with gensim is very fast, computing its dictionaries (Dictionary()) is relatively slow in my experience.

Similar questions have been asked before, especially here and here, and the bridging solution is gensim's Sparse2Corpus() function which transforms a Scipy sparse matrix into a gensim corpus object.

However, this conversion does not make use of the vocabulary_ attribute of sklearn vectorizers, which holds the mapping between words and feature ids. This mapping is necessary in order to print the discriminant words for each topic (id2word in gensim topic models, described as "a a mapping from word ids (integers) to words (strings)").

I am aware of the fact that gensim's Dictionary objects are much more complex (and slower to compute) than scikit's vect.vocabulary_ (a simple Python dict)...

Any ideas to use vect.vocabulary_ as id2word in gensim models?

Some example code:

# our data
documents = [u'Human machine interface for lab abc computer applications',
        u'A survey of user opinion of computer system response time',
        u'The EPS user interface management system',
        u'System and human system engineering testing of EPS',
        u'Relation of user perceived response time to error measurement',
        u'The generation of random binary unordered trees',
        u'The intersection graph of paths in trees',
        u'Graph minors IV Widths of trees and well quasi ordering',
        u'Graph minors A survey']

from sklearn.feature_extraction.text import CountVectorizer
# compute vector space with sklearn
vect = CountVectorizer(min_df=1, ngram_range=(1, 1), max_features=25000)
corpus_vect = vect.fit_transform(documents)
# each doc is a scipy sparse matrix
print vect.vocabulary_
#{u'and': 1, u'minors': 20, u'generation': 9, u'testing': 32, u'iv': 15, u'engineering': 5, u'computer': 4, u'relation': 28, u'human': 11, u'measurement': 19, u'unordered': 37, u'binary': 3, u'abc': 0, u'for': 8, u'ordering': 23, u'graph': 10, u'system': 31, u'machine': 17, u'to': 35, u'quasi': 26, u'time': 34, u'random': 27, u'paths': 24, u'of': 21, u'trees': 36, u'applications': 2, u'management': 18, u'lab': 16, u'interface': 13, u'intersection': 14, u'response': 29, u'perceived': 25, u'in': 12, u'widths': 40, u'well': 39, u'eps': 6, u'survey': 30, u'error': 7, u'opinion': 22, u'the': 33, u'user': 38}

import gensim
# transform sparse matrix into gensim corpus
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)
lsi = gensim.models.LsiModel(corpus_vect_gensim, num_topics=4)
# I instead would like something like this line below
# lsi = gensim.models.LsiModel(corpus_vect_gensim, id2word=vect.vocabulary_, num_topics=2)
print lsi.print_topics(2)
#['0.622*"21" + 0.359*"31" + 0.256*"38" + 0.206*"29" + 0.206*"34" + 0.197*"36" + 0.170*"33" + 0.168*"1" + 0.158*"10" + 0.147*"4"', '0.399*"36" + 0.364*"10" + -0.295*"31" + 0.245*"20" + -0.226*"38" + 0.194*"26" + 0.194*"15" + 0.194*"39" + 0.194*"23" + 0.194*"40"']
like image 738
emiguevara Avatar asked Feb 04 '14 12:02

emiguevara


3 Answers

Gensim doesn't require Dictionary objects. You can use your plain dict as input to id2word directly, as long as it maps ids (integers) to words (strings).

In fact anything dict-like will do (including dict, Dictionary, SqliteDict...).

(Btw gensim's Dictionary is a simple Python dict underneath. Not sure where your remarks on Dictionary performance come from, you can't get a mapping much faster than a plain dict in Python. Maybe you're confusing it with text preprocessing (not part of gensim), which can indeed be slow.)

like image 167
Radim Avatar answered Oct 14 '22 17:10

Radim


Just to provide with a final example, scikit-learn's vectorizers objects can be transformad into gensim's corpus format with Sparse2Corpus while the vocabulary dict can be recycled by simply swapping keys and values:

# transform sparse matrix into gensim corpus
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)

# transform scikit vocabulary into gensim dictionary
vocabulary_gensim = {}
for key, val in vect.vocabulary_.items():
    vocabulary_gensim[val] = key
like image 8
emiguevara Avatar answered Oct 14 '22 17:10

emiguevara


I am also running some code experiments using these two. Apparently there's a way to construct the dictionary from corpus now

from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary.from_corpus(corpus_vect_gensim,
                                    id2word=dict((id, word) for word, id in vect.vocabulary_.items()))

Then you can use this dictionary for tfidf, LSI or LDA models.

like image 4
Jeffrey04 Avatar answered Oct 14 '22 18:10

Jeffrey04