i need to extract terms with highest frequencies from several lucene indexes, to use them for some semantic analysis.
So, I want to get maybe top 30 most occuring terms(still did not decide on threshold, i will analyze results) and their per-index counts. I am aware that I might lose some precision because of potentionally dropped duplicates, but for now, lets say i am ok with that.
So for the proposed solutions, (needless to say maybe) speed is not important, since I would do static analysis, I would put accent on simplicity of implementation because im not so skilled with Lucene and cant wrap my mind around some concepts of it..
I can not find any code samples from something similar, so all concrete advices (code, pseudocode, links to code samples...) Appreciate all the advices!
Thank you!
A very simple way would be to use Luke. On the 'Overview' tab, there is a 'Show top terms' button that can be used for what you need.
Have a look at this: http://sujitpal.blogspot.com/2009/02/summarization-with-lucene.html
The class in this page hascomputeTopTermQuery
method which you should be easily able to retrofit for going over multiple indexes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With