Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lucene 4.10.2 calculate tf-idf for all terms in index

I would like to calculate the term frequency and the inverse document frequency (tf-idf) for all terms in index,

I couldn't find any example how to do it with latest Lucene (4.x.x).

Could you help me?

like image 850
tommy Avatar asked Dec 14 '25 13:12

tommy


1 Answers

To iterate through terms in the index, you'll want to use Fields and Terms. Terms exposes the docfreq() for your idf calculation. Of course, IndexReader itself exposes the numDocs(). You can use DefaultSimilarity.idf to perform the calculations for you, rather than rolling your own.

DefaultSimilarity similarity = new DefaultSimilarity();
int docnum = reader.numDocs();
Fields fields = MultiFields.getFields(reader);
for (String field : fields) {
    Terms terms = fields.terms(field);
    TermsEnum termsEnum = terms.iterator(null);
    while (termsEnum.next() != null) {
        double idf = similarity.idf(termsEnum.docFreq(), docnum);
        System.out.println("" + field + ":" + termsEnum.term().utf8ToString() + " idf=" + idf);
    }
}

tf is only relevant to the term with regards to a specific document, so not quite sure what you are looking for there.

like image 69
femtoRgon Avatar answered Dec 17 '25 13:12

femtoRgon



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!