Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gensim Dictionary Implementation

I was just curious about the gensim dictionary implementation. I have the following code:

    def build_dictionary(documents):
        dictionary = corpora.Dictionary(documents)
        dictionary.save('/tmp/deerwester.dict') # store the dictionary
        return dictionary    

and I looked inside the file deerwester.dict and it looks like this:

8002 6367 656e 7369 6d2e 636f 7270 6f72
612e 6469 6374 696f 6e61 7279 0a44 6963
7469 6f6e 6172 790a 7101 2981 7102 7d71
0328 5508 6e75 6d5f 646f 6373 7104 4b09
5508 ...

the following code, however,

my_dict = dictionary.load('/tmp/deerwester.dict') 
print my_dict.token2id #view dictionary

yields this:

{'minors': 30, 'generation': 22, 'testing': 16, 'iv': 29, 'engineering': 15, 'computer': 2, 'relation': 20, 'human': 3, 'measurement': 18, 'unordered': 25, 'binary': 21, 'abc': 0, 'ordering': 31, 'graph': 26, 'system': 10, 'machine': 6, 'quasi': 32, 'random': 23, 'paths': 28, 'error': 17, 'trees': 24, 'lab': 5, 'applications': 1, 'management': 14, 'user': 12, 'interface': 4, 'intersection': 27, 'response': 8, 'perceived': 19, 'widths': 34, 'well': 33, 'eps': 13, 'survey': 9, 'time': 11, 'opinion': 7}

So my question is, since I don't see the actual words inside the .dict file, what are all of the hexadecimal values stored there? Is this some kind of super compressed format? I'm curious because I feel like if it is, I should consider using it from now on.

like image 533
dmil Avatar asked Aug 12 '13 09:08

dmil


People also ask

What does dictionary do in Gensim?

Dictionary encapsulates the mapping between normalized words and their integer ids.

What can you do with Gensim?

It is a great package for processing texts, working with word vector models (such as Word2Vec, FastText etc) and for building topic models. Also, another significant advantage with gensim is: it lets you handle large text files without having to load the entire file in memory.


1 Answers

Given the example:

>>> from gensim import corpora
>>> docs = ["this is a foo bar", "you are a foo"]
>>> texts = [[i for i in doc.lower().split()] for doc in docs]
>>> print texts
[['this', 'is', 'a', 'foo', 'bar'], ['you', 'are', 'a', 'foo']]

>>> dictionary = corpora.Dictionary(texts)
>>> dictionary.save('foobar.txtdic')

If you use the gensim.corpora.dictionary.save_as_text() (see https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/dictionary.py), you should have got the below text file:

0   a   2
5   are 1
1   bar 1
2   foo 2
3   is  1
4   this    1
6   you 1

If you use the default gensim.corpora.dictionary.save(), it saves into a pickled binary file. See class SaveLoad(object) in https://github.com/piskvorky/gensim/blob/develop/gensim/utils.py

For information on pickle, see http://docs.python.org/2/library/pickle.html#pickle-example

like image 116
alvas Avatar answered Sep 26 '22 08:09

alvas