I am using Gensim for some topic modelling and I have gotten to the point where I am doing similarity queries using the LSI and tf-idf models. I get back the set of IDs and similarities, eg. (299501, 0.64505910873413086)
.
How do I get the text document that is related to the ID, in this case 299501?
I have looked at the docs for corpus, dictionary, index, and the model and cannot seem to find it.
Sadly, as far as I can tell, you have to start from the very beginning of the analysis knowing that you'll want to retrieve documents by the ids. This means you need to create your own mapping between ids and the original documents and make sure the ids gensim
uses are preserved throughout the process. As is, I don't think gensim
keeps such a mapping handy.
I could definitely be wrong, and in fact I'd love it if someone tells me there is an easier way, but I spent many hours trying to avoid re-running a gigantic LSI model on a wikipedia corpus to no avail. Eventually I had to carry along a list of ids and the associated documents so I could use gensim
's output.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With