This is more of a theory question rather than practice. I'm working on a project which is quite a simple catalog of links. The whole model is similar to the Dmoz or Yahoo catalog, except that each entry has certain additional attributes.
I have hierarchical taxonomy working on all entries with many-to-many relationship, all entries are now sorted into these categories, and everything seems to work fine. Now, what use is a catalog if there's no search option?
Here's a little bit more detail about my models: Each entry has a title, description, URL and several social profiles: YouTube, Twitter, Flickr and a couple of others. Each entry could have a logo attached to it, and a hidden field for tags. Also, the title and description are stored in three different languages. So basically I'd like the search results to be:
I've tried Sphinx and currently working with Lucene, but it seems that I'm not getting the search right in theory. I hope it does make sense that filled entries should appear higher than the others, but I can't really figure out the scores. I wouldn't like irrelevant entries appear on top if there's simply one word match in the entire description, since titles are more relevant.
So my question is - are there any books, techniques or even other search engines (if Sphinx and Lucene are not good enough) that you would recommend for this matter? Not only I would like to get full control over search results and their ranking, but also give my visitors correct and relevant information.
Links on cool articles are appreciated too!
And No, I'm not trying to rebuild Google :)
Thanks :)
To develop a search strategy: 1) Identify the main concepts of your research question. These concepts become keywords for your searches. 3) Combine terms using Boolean Logic. 4) Identify the tool(s) (databases, catalogs, etc.)
Excellent book: Lucene in Action (2nd edition)
When we started with Lucene we had the first edition, it really takes you through everything you need step by step. Highly recommended. The 2nd edition is updated for the latest and greatest version (3.x.x).
The Tf-Idf algorithm works very well on (larger) texts, but if you have a record-like structure it may backfire: the documents with a few terms are considered more "relevant" than the ones with many terms. With Lucene, you will get it to work, but you'll have to get your hands dirty.
What you'll basically have to do is boost your title field, so it becomes more relevant. You may also change the scoring mechanism to assign higher scores for documents that have more information.
Have fun. If you can't figure it out, there is excellent support on the Lucene mailinglist.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With