Searching Techniques Recommendations

Tags:

This is more of a theory question rather than practice. I'm working on a project which is quite a simple catalog of links. The whole model is similar to the Dmoz or Yahoo catalog, except that each entry has certain additional attributes.

I have hierarchical taxonomy working on all entries with many-to-many relationship, all entries are now sorted into these categories, and everything seems to work fine. Now, what use is a catalog if there's no search option?

Here's a little bit more detail about my models: Each entry has a title, description, URL and several social profiles: YouTube, Twitter, Flickr and a couple of others. Each entry could have a logo attached to it, and a hidden field for tags. Also, the title and description are stored in three different languages. So basically I'd like the search results to be:

Relevant (including taxonomy)
Possibly ones with logos
Possibly ones with 100% filled out profiles

I've tried Sphinx and currently working with Lucene, but it seems that I'm not getting the search right in theory. I hope it does make sense that filled entries should appear higher than the others, but I can't really figure out the scores. I wouldn't like irrelevant entries appear on top if there's simply one word match in the entire description, since titles are more relevant.

So my question is - are there any books, techniques or even other search engines (if Sphinx and Lucene are not good enough) that you would recommend for this matter? Not only I would like to get full control over search results and their ranking, but also give my visitors correct and relevant information.

Links on cool articles are appreciated too!

And No, I'm not trying to rebuild Google :)

Thanks :)

692

asked Oct 29 '10 08:10

kovshenin

1 Answers

Excellent book: Lucene in Action (2nd edition)

When we started with Lucene we had the first edition, it really takes you through everything you need step by step. Highly recommended. The 2nd edition is updated for the latest and greatest version (3.x.x).

The Tf-Idf algorithm works very well on (larger) texts, but if you have a record-like structure it may backfire: the documents with a few terms are considered more "relevant" than the ones with many terms. With Lucene, you will get it to work, but you'll have to get your hands dirty.

What you'll basically have to do is boost your title field, so it becomes more relevant. You may also change the scoring mechanism to assign higher scores for documents that have more information.

Have fun. If you can't figure it out, there is excellent support on the Lucene mailinglist.

answered Sep 22 '22 04:09

Matthijs Bierman

Related questions
                            
                                How can I Programmatically perform a search without using an API?
                            
                                Advice for Building a dynamic "Advanced Search" Control in ASP.NET
                            
                                Use SQL to query javascript objects? [closed]
                            
                                Is there a name for this type of binary search?
                            
                                PHP mysql search queries
                            
                                500,000 street names - what data structure and to use to implement a fast search?
                            
                                Location Search Results overlay
                            
                                Grep after and before lines of last Match
                            
                                How to reset stop words in MYSQL?
                            
                                How do I select all elements in a list that are out-of-order?
                            
                                Solr and web site indexing to create a site search
                            
                                Windows Search using OLE DB SQL Fields
                            
                                What is the best auto-suggest search algorithm for javascript
                            
                                How do I search for an executable file using python in linux?
                            
                                WinSCP: Text search on remote files
                            
                                IMAP - How to search for all messages in a conversation thread?
                            
                                Binary search for the closest value less than or equal to the search value
                            
                                Searching array reports "not found" even though it's found
                            
                                PostgreSQL - tree organization
                            
                                High level explanation of Similarity Class for Lucene?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Searching Techniques Recommendations

Tags:

search

full-text-search

lucene

search-engine

sphinx

kovshenin

People also ask

1 Answers

Matthijs Bierman

Recent Activity

Donate For Us