How to count term frequency for set of documents?

Tags:

lucene

i have a Lucene-Index with following documents:

doc1 := { caldari, jita, shield, planet }
doc2 := { gallente, dodixie, armor, planet }
doc3 := { amarr, laser, armor, planet }
doc4 := { minmatar, rens, space }
doc5 := { jove, space, secret, planet }

so these 5 documents use 14 different terms:

[ caldari, jita, shield, planet, gallente, dodixie, armor, amarr, laser, minmatar, rens, jove, space, secret ]

the frequency of each term:

[ 1, 1, 1, 4, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1 ]

for easy reading:

[ caldari:1, jita:1, shield:1, planet:4, gallente:1, dodixie:1, 
armor:2, amarr:1, laser:1, minmatar:1, rens:1, jove:1, space:2, secret:1 ]

What i do want to know now is, how to obtain the term frequency vector for a set of documents?

for example:

Set<Documents> docs := [ doc2, doc3 ]

termFrequencies = magicFunction(docs); 

System.out.pring( termFrequencies );

would result in the ouput:

[ caldari:0, jita:0, shield:0, planet:2, gallente:1, dodixie:1, 
armor:2, amarr:1, laser:1, minmatar:0, rens:0, jove:0, space:0, secret:0 ]

remove all zeros:

[ planet:2, gallente:1, dodixie:1, armor:2, amarr:1, laser:1 ]

Notice, that the result vetor contains only the term frequencies of the set of documents. NOT the overall frequencies of the whole index! The term 'planet' is present 4 times in the whole index but the source set of documents only contains it 2 times.

A naive implementation would be to just iterate over all documents in the docs set, create a map and count each term. But i need a solution that would also work with a document set size of 100.000 or 500.000.

Is there a feature in Lucene i can use to obtain this term vector? If there is no such feature, how would a data structure look like someone can create at index time to obtain such a term vector easily and fast?

I'm not that Lucene expert so i'am sorry if the solution is obvious or trivial.

Maybe worth to mention: the solution should work fast enough for a web application, applied to client search queries.

678

asked May 27 '10 19:05

ManBugra

1 Answers

Go here: http://lucene.apache.org/java/3_0_1/api/core/index.html and check this method

org.apache.lucene.index.IndexReader.getTermFreqVectors(int docno);

you will have to know the document id. This is an internal lucene id and it usually changes on every index update (that has deletes :-)).

I believe there is a similar method for lucene 2.x.x

181

answered Sep 29 '22 19:09

Mihai Toader

Related questions
                            
                                How can I do LZW decoding in Java?
                            
                                hibernate column uniqueness question
                            
                                Force repaint after button click
                            
                                Apache BeanUtils.copyProperties is spilling too much log
                            
                                Performance bottleneck in concurrent calls to System.currentTimeInMillis()
                            
                                In Google's Protocol Buffers, what is a suitable protocol file/model for Exceptions?
                            
                                Best Practice for creating Web Services
                            
                                Servlet 3.0 annotations <welcome-file>
                            
                                Loading GWT Messages from a Database
                            
                                Embedding swank-clojure in java program
                            
                                How to get JOptionPane with three text fields
                            
                                Download album art from internet in an android application
                            
                                Web Start Application built on NetBeans Platform doesn't create desktop shortcut & start menu item
                            
                                Batch and the for loop
                            
                                Calculating end date while skipping holidays + Joda time
                            
                                iCal4j and newlines
                            
                                RequestDispatcher forward between Tomcat instances
                            
                                readline-like library for Java [closed]
                            
                                Eclipse JDT: Call 'correct indentation' programmatically?
                            
                                IWAB0399E Error in generating Java from WSDL: java.io.IOException: ERROR: Missing <soap:fault> element inFault

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With