Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get highest frequency terms from Lucene index

i need to extract terms with highest frequencies from several lucene indexes, to use them for some semantic analysis.

So, I want to get maybe top 30 most occuring terms(still did not decide on threshold, i will analyze results) and their per-index counts. I am aware that I might lose some precision because of potentionally dropped duplicates, but for now, lets say i am ok with that.

So for the proposed solutions, (needless to say maybe) speed is not important, since I would do static analysis, I would put accent on simplicity of implementation because im not so skilled with Lucene and cant wrap my mind around some concepts of it..

I can not find any code samples from something similar, so all concrete advices (code, pseudocode, links to code samples...) Appreciate all the advices!

Thank you!

like image 525
Julia Avatar asked May 12 '10 19:05

Julia


2 Answers

A very simple way would be to use Luke. On the 'Overview' tab, there is a 'Show top terms' button that can be used for what you need.

like image 191
Pascal Dimassimo Avatar answered Sep 22 '22 21:09

Pascal Dimassimo


Have a look at this: http://sujitpal.blogspot.com/2009/02/summarization-with-lucene.html

The class in this page hascomputeTopTermQuery method which you should be easily able to retrofit for going over multiple indexes.

like image 34
mindas Avatar answered Sep 25 '22 21:09

mindas