i have a Lucene-Index with following documents:
doc1 := { caldari, jita, shield, planet }
doc2 := { gallente, dodixie, armor, planet }
doc3 := { amarr, laser, armor, planet }
doc4 := { minmatar, rens, space }
doc5 := { jove, space, secret, planet }
so these 5 documents use 14 different terms:
[ caldari, jita, shield, planet, gallente, dodixie, armor, amarr, laser, minmatar, rens, jove, space, secret ]
the frequency of each term:
[ 1, 1, 1, 4, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1 ]
for easy reading:
[ caldari:1, jita:1, shield:1, planet:4, gallente:1, dodixie:1,
armor:2, amarr:1, laser:1, minmatar:1, rens:1, jove:1, space:2, secret:1 ]
What i do want to know now is, how to obtain the term frequency vector for a set of documents?
for example:
Set<Documents> docs := [ doc2, doc3 ]
termFrequencies = magicFunction(docs);
System.out.pring( termFrequencies );
would result in the ouput:
[ caldari:0, jita:0, shield:0, planet:2, gallente:1, dodixie:1,
armor:2, amarr:1, laser:1, minmatar:0, rens:0, jove:0, space:0, secret:0 ]
remove all zeros:
[ planet:2, gallente:1, dodixie:1, armor:2, amarr:1, laser:1 ]
Notice, that the result vetor contains only the term frequencies of the set of documents. NOT the overall frequencies of the whole index! The term 'planet' is present 4 times in the whole index but the source set of documents only contains it 2 times.
A naive implementation would be to just iterate over all documents in the
docs
set, create a map and count each term.
But i need a solution that would also work with a document set size of
100.000 or 500.000.
Is there a feature in Lucene i can use to obtain this term vector? If there is no such feature, how would a data structure look like someone can create at index time to obtain such a term vector easily and fast?
I'm not that Lucene expert so i'am sorry if the solution is obvious or trivial.
Maybe worth to mention: the solution should work fast enough for a web application, applied to client search queries.
Text Analysis The inverse document frequency is a measure of whether a term is common or rare in a given document corpus. It is obtained by dividing the total number of documents by the number of documents containing the term in the corpus.
To reduce this effect, term frequency is often divided by the total number of terms in the document as a way of normalization. TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
Term Frequency. While document frequency is number of documents containing a term, term frequency is the number of occurrences of a term within a document.
One measure of how important a word may be is its term frequency (tf), how frequently a word occurs in a document. There are words in a document, however, that occur many times but may not be important; in English, these are probably words like “the”, “is”, “of”, and so forth.
Go here: http://lucene.apache.org/java/3_0_1/api/core/index.html and check this method
org.apache.lucene.index.IndexReader.getTermFreqVectors(int docno);
you will have to know the document id. This is an internal lucene id and it usually changes on every index update (that has deletes :-)).
I believe there is a similar method for lucene 2.x.x
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With