If I write and algorithm that performs a search using Lucene how can I state the computational complexity of it? I know that Lucene uses tf*idf scoring but I don't know how it is implemented. I've found that tf*idf has the following complexity:
O(|D|+|T|)
where D is the set of documents and T the set of all terms.
However, I need someone who could check if this is correct and explain me why.
Thank you
Lucene basically uses a Vector Space Model
(VSM) with a tf-idf
scheme. So, in the standard setting we have:
We determine the K
documents of the collection with the highest vector space scores on the query q
. Typically, we seek these K top documents ordered by score in decreasing order; for instance many search engines use K = 10 to retrieve and rank-order the first page of the ten best results.
The basic algorithm for computing vector space scores is:
float Scores[N] = 0
Initialize Length[N]
for each query term t
do calculate w(t,q) and fetch postings list for t (stored in the index)
for each pair d,tf(t,d) in postings list
do Scores[d] += wf(t,d) X w(t,q) (dot product)
Read the array Length[d]
for each d
do Scored[d] = Scores[d] / Length[d]
return Top K components of Scores[]
Where
Length
holds the lengths (normalization factors) for each of the N
documents, whereas the array Scores
holds the scores for each of the documents.tf
is the term frequency of a term in a document.w(t,q)
is the weight of the submitted query for a given term. Note that query is treated as a bag of words
and the vector of weights can be considered (as if it was another document).wf(d,q)
is the logarithmic term weighting for query and documentAs described here: Complexity of vector dot-product, vector dot-product is O(n)
. Here the dimension is the number of terms in our vocabulary: |T|
, where T
is the set of terms.
So, the time complexity of this algorithm is:
O(|Q|· |D| · |T|) = O(|D| · |T|)
we consider |Q| fixed, where Q
is the set of words in the query (which average size is low, in average a query has between 2 and 3 terms) and D
is the set of all documents.
However, for a search, these sets are bounded and indexes don't tend to grow very often. So, as a result, searches using VSM are really fast (when T
and D
are large the search is really slow and one has to find an alternative approach).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With