This is my program to calculate TF-IDF value for a document in a collection of documents. This is working fine, but takes lot of time when calculating the "IDF" values (finding the no of documents which contains particular term).
Is there a more efficient way of finding the no of documents which contains a particular term?
freq = termsFreq.getTermFrequencies();
terms = termsFreq.getTerms();
int noOfTerms = terms.length;
score = new float[noOfTerms];
DefaultSimilarity simi = new DefaultSimilarity();
for (i = 0; i < noOfTerms; i++) {
int noofDocsContainTerm = noOfDocsContainTerm(terms[i]);
float tf = simi.tf(freq[i]);
float idf = simi.idf(noofDocsContainTerm, noOfDocs);
score[i] = tf * idf ;
}
////
public int noOfDocsContainTerm(String querystr) throws CorruptIndexException, IOException, ParseException{
QueryParser qp=new QueryParser(Version.LUCENE_35, "docuemnt", new StandardAnalyzer(Version.LUCENE_35));
Query q=qp.parse(querystr);
int hitsPerPage = docNames.length; //minumum number or search results
IndexSearcher searcher = new IndexSearcher(ramMemDir, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
return hits.length;
}
If you have a term and want its document frequency, i.e. the number of documents that contain this term: Call IndexReader.termEnum(Term) method. It gives you a TermEnum object. Then, call TermEnum.docFreq() method. It gives you the term's document frequency in the index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With