Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.

I am having a hard time finding solutions for this type of problem.

I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q

The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?

like image 594
edub Avatar asked Mar 12 '10 08:03

edub


People also ask

What is the best way to do a reverse image search?

Reverse image searching 101 On a desktop, computer reverse image search is simple. Just go to images.google.com and click on the little camera icon in the search bar. Now you can either paste in the URL for an image you've seen online, upload an image from your hard drive, or drag an image into the search box.

Is there a better reverse image search than Google?

1. TinEye Reverse Image Search Engine. TinEye is a reverse image search engine that helps you source images and finds where they appear on the web.

How reliable is reverse image search?

It's pretty accurate in searching through image databases and finding similar images. That is if the images are available on internet sites. Otherwise, it looks for similarities that do not reveal accurate results.


2 Answers

At the database level, many databases offer 'triggers'.

Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.

You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.

A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.

like image 91
Will Avatar answered Sep 28 '22 06:09

Will


The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.

Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.

One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.

Update for comment:

Short answer: I don't know for sure.

Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)

I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.

For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.

like image 38
Peter Rowell Avatar answered Sep 28 '22 07:09

Peter Rowell