Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count term frequency for set of documents?

Tags:

java

lucene

i have a Lucene-Index with following documents:

doc1 := { caldari, jita, shield, planet }
doc2 := { gallente, dodixie, armor, planet }
doc3 := { amarr, laser, armor, planet }
doc4 := { minmatar, rens, space }
doc5 := { jove, space, secret, planet }

so these 5 documents use 14 different terms:

[ caldari, jita, shield, planet, gallente, dodixie, armor, amarr, laser, minmatar, rens, jove, space, secret ]

the frequency of each term:

[ 1, 1, 1, 4, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1 ]

for easy reading:

[ caldari:1, jita:1, shield:1, planet:4, gallente:1, dodixie:1, 
armor:2, amarr:1, laser:1, minmatar:1, rens:1, jove:1, space:2, secret:1 ]

What i do want to know now is, how to obtain the term frequency vector for a set of documents?

for example:

Set<Documents> docs := [ doc2, doc3 ]

termFrequencies = magicFunction(docs); 

System.out.pring( termFrequencies );

would result in the ouput:

[ caldari:0, jita:0, shield:0, planet:2, gallente:1, dodixie:1, 
armor:2, amarr:1, laser:1, minmatar:0, rens:0, jove:0, space:0, secret:0 ]

remove all zeros:

[ planet:2, gallente:1, dodixie:1, armor:2, amarr:1, laser:1 ]

Notice, that the result vetor contains only the term frequencies of the set of documents. NOT the overall frequencies of the whole index! The term 'planet' is present 4 times in the whole index but the source set of documents only contains it 2 times.

A naive implementation would be to just iterate over all documents in the docs set, create a map and count each term. But i need a solution that would also work with a document set size of 100.000 or 500.000.

Is there a feature in Lucene i can use to obtain this term vector? If there is no such feature, how would a data structure look like someone can create at index time to obtain such a term vector easily and fast?

I'm not that Lucene expert so i'am sorry if the solution is obvious or trivial.

Maybe worth to mention: the solution should work fast enough for a web application, applied to client search queries.

like image 678
ManBugra Avatar asked May 27 '10 19:05

ManBugra


People also ask

How do you find the frequency of a document?

Text Analysis The inverse document frequency is a measure of whether a term is common or rare in a given document corpus. It is obtained by dividing the total number of documents by the number of documents containing the term in the corpus.

What is term frequency formula?

To reduce this effect, term frequency is often divided by the total number of terms in the document as a way of normalization. TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

What is the difference between term frequency and document frequency?

Term Frequency. While document frequency is number of documents containing a term, term frequency is the number of occurrences of a term within a document.

Which of the following measures how frequently a term appears in the document?

One measure of how important a word may be is its term frequency (tf), how frequently a word occurs in a document. There are words in a document, however, that occur many times but may not be important; in English, these are probably words like “the”, “is”, “of”, and so forth.


1 Answers

Go here: http://lucene.apache.org/java/3_0_1/api/core/index.html and check this method

org.apache.lucene.index.IndexReader.getTermFreqVectors(int docno);

you will have to know the document id. This is an internal lucene id and it usually changes on every index update (that has deletes :-)).

I believe there is a similar method for lucene 2.x.x

like image 181
Mihai Toader Avatar answered Sep 29 '22 19:09

Mihai Toader