Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

General strategy for complex multi-stage searches

I have an application with allows for a certain entity to be searched upon based on several different criteria (somewhere in the order of 20 different methods in total). I want to be able to combine the results of several searches in order to produce a single result set.

For example:

results = (entities from search 1 AND entities from search 2) OR (entities from search 3)

Let us assume that the searches are complex enough in nature such that combining them into a single logical query is not possible (due to complex relationships that need to be queried, etc).

Let us also assume that the number of entities involved (likely) makes any sort of in-memory strategy infeasible.

My initial thoughts were something along the lines of:

1) Perform the searches seperately, obtain a list of matching "entity ids" from each of them, and then perform a "root-level" search based upon these.

For example:

select * from entity e
where 
(e.Id in (search 1 id list) AND e.Id in(search 2 id list))
OR e.Id in (search 3 id list)

2) Perform an outer query that selects the entity based upon the results returned by my (complex) subqueries.

For example:

select * from entity e
where (e.Id in (select e1.id from entity e1 where ...) AND e.Id in (select e2.id from entity e2 where...))
OR e.Id in (select e3.id from entity e3 where...)

Obviously these examples are drastically simplified for illustration purposes; the individual queries will be much more involved, and the combination of them will be arbitrary (I've just illustrated a representative example here).

I'd be very interested in hearing suggestions for how others have handled this situation. I'm certainly open to any possibilities that I haven't explored above.

For reference, this is a .NET application making use of an NHibernate ORM backed by a SQL Server 2008 R2 database.

I've already decided upon using either hql or native sql for this as ICriteria or Linq do not provide the flexibility needed for performing the individual queries nor the combining operations required.

like image 657
DanP Avatar asked Oct 14 '22 21:10

DanP


1 Answers

I've done this by keeping search performance counters in a table. Basically monitoring the average percentage of rows that the search filters and the run time.

I then create a performance figure based on TotalNumberOfRowsToSearch * Percent_Not_Matched / RunTimeInSeconds This figure is a direct correlation of rows per second it can filter out. Averaged over thousands of runs, it is a rather good prediction.

I then run each query in order with highest performance figure one first.

If you're doing a logical AND on the total result, run each subsequent query only on the results of the previous query.

If you're doing a logical OR, run each subsequent query only on the results NOT IN the combined previous search results.

By doing it this way, your query will change based on indexes and types of data.

If you want a less dynamic solution, simply calculate performance figures for each part of the search and use the better performing ones first. Remember a query that runs in 55ms but matches 99% of the results is not as useful as one that runs in 1 second and matches 1% of the results, so be wary that results may go against your initial ideas.

Just look out for the divide by 0 error when calculating performance figures.

like image 196
John Avatar answered Oct 31 '22 13:10

John