I would like to calculate the term frequency and the inverse document frequency (tf-idf) for all terms in index,
I couldn't find any example how to do it with latest Lucene (4.x.x).
Could you help me?
To iterate through terms in the index, you'll want to use Fields and Terms. Terms exposes the docfreq() for your idf calculation. Of course, IndexReader itself exposes the numDocs(). You can use DefaultSimilarity.idf to perform the calculations for you, rather than rolling your own.
DefaultSimilarity similarity = new DefaultSimilarity();
int docnum = reader.numDocs();
Fields fields = MultiFields.getFields(reader);
for (String field : fields) {
Terms terms = fields.terms(field);
TermsEnum termsEnum = terms.iterator(null);
while (termsEnum.next() != null) {
double idf = similarity.idf(termsEnum.docFreq(), docnum);
System.out.println("" + field + ":" + termsEnum.term().utf8ToString() + " idf=" + idf);
}
}
tf is only relevant to the term with regards to a specific document, so not quite sure what you are looking for there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With