Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieve string version of document by ID in Gensim

Tags:

python

gensim

I am using Gensim for some topic modelling and I have gotten to the point where I am doing similarity queries using the LSI and tf-idf models. I get back the set of IDs and similarities, eg. (299501, 0.64505910873413086).

How do I get the text document that is related to the ID, in this case 299501?

I have looked at the docs for corpus, dictionary, index, and the model and cannot seem to find it.

like image 905
jisaw Avatar asked Feb 12 '15 22:02

jisaw


1 Answers

Sadly, as far as I can tell, you have to start from the very beginning of the analysis knowing that you'll want to retrieve documents by the ids. This means you need to create your own mapping between ids and the original documents and make sure the ids gensim uses are preserved throughout the process. As is, I don't think gensim keeps such a mapping handy.

I could definitely be wrong, and in fact I'd love it if someone tells me there is an easier way, but I spent many hours trying to avoid re-running a gigantic LSI model on a wikipedia corpus to no avail. Eventually I had to carry along a list of ids and the associated documents so I could use gensim's output.

like image 127
Jason Avatar answered Oct 20 '22 16:10

Jason