Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get term frequencies in Lucene

Is there a fast and easy way of getting term frequencies from a Lucene index, without doing it through the TermVectorFrequencies class, since that takes an awful lot of time for large collections?

What I mean is, is there something like TermEnum which has not just the document frequency but term frequency as well?

UPDATE: Using TermDocs is way too slow.

like image 858
Ilija Avatar asked Mar 20 '09 18:03

Ilija


3 Answers

Use TermDocs to get the term frequency for a given document. Like the document frequency, you get the term documents from an IndexReader, using the term of interest.


You won't find a faster method than TermDocs without losing some generality. TermDocs reads directly from the ".frq" file in an index segment, where each term frequency is listed in document order.

If that's "too slow", make sure that you've optimized your index to merge multiple segments into a single segment. Iterate over the documents in order (skips are alright, but you can't jump back and forth in the document list efficiently).

Your next step might be additional processing to create an even more specialized file structure that leaves out the SkipData. Personally I would look for a better algorithm to achieve my objective, or provide better hardware—lots of memory, either to hold a RAMDirectory, or to give to the OS for use on its own file-caching system.

like image 56
erickson Avatar answered Nov 19 '22 00:11

erickson


The trunk version of Lucene (to be 4.0, eventually) now exposes the totalTermFreq() for each term from the TermsEnum. This is the total number of times this term appeared in all content (but, like docFreq, does not take into account deletions).

like image 29
Michael McCandless Avatar answered Nov 18 '22 22:11

Michael McCandless


TermDocs gives the TF of a given term in each document that contains the term. You can get the DF by iterating through each <document, frequency> pair and counting the number of pairs, although TermEnums should be faster. IndexReader has a termDocs(Term) method that returns a TermDocs for the given Term and index.

like image 1
Kai Chan Avatar answered Nov 18 '22 23:11

Kai Chan