I am new to IR techniques.
I looking for a Java based API or tool that does the following.
Kindly let me know how can Lucene be helpful to me.
Regards Yuvi
idf(t,D) = log (N/( n))N is the number of documents in the data set. n is the number of documents that contain the term t among the data set. Finally TFIDF is calculated as the product of the above two values.
The TF-IDF of a term is calculated by multiplying TF and IDF scores. Translated into plain English, importance of a term is high when it occurs a lot in a given document and rarely in others. In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF.
The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...
TF-IDF is used by search engines to better understand the content that is undervalued. For example, when you search for “Coke” on Google, Google may use TF-IDF to figure out if a page titled “COKE” is about: a) Coca-Cola. b) Cocaine.
You could try the Word Vector Tool - it's been a while since the latest release, but it works fine here. It should be able to perform all of the steps you mention. I've never used the crawler part myself, however.
Actually, TF-IDF is a score given to a term in a document, rather than the whole document. If you just want the TF-IDFs per term in document, maybe use this method, without ever touching Lucene. If you want to create a search engine, you need to do a bit more (such as extracting text from the given URLs, whose corresponding documents would probably not contain raw text). If this is the case, consider using Solr.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With