Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The best way to organize search service, which can find data in a big database according to the set of filters

We have big dataset (about 191 million of records, will be grow), every record contains the values of filters (11 filters - datetime and integer values), and some additional data (cost). For example:

Depature City = 1
Arrival City = 5
Country Id = 7
Check In Date = 2013-05-05
    ... etc

Cost 1250
    ... etc

We have a search interface with 11 filters. In every filter user can choose: one value, a set of values, all values.

Every filter have the different set of possible values, it can vary from 4 to 5000 values.

The result of search must be sorted by ascending cost, there are paging (50 result per page)

Every search query must be completed in 100 mS, usually expected 50-70 requests/sec (200 as maximum).

The data will be changes often, but the speed of data changing has the lower priority, than search this process can be slow.

What is the best way to organise such search engine? Data in memory (we tried some tree algoritms), Map-Reduce (Hadoop?), OLAP?

UPDATE. What do you think about some in memory solution? The records can be loaded to the operation memory in some good for search and sort structure. What structure is the best?

In production environment, the client will be able to supply appropriate hardware for good solution.

In general, we have a .NET solution - so, this module must be compatible with it.

like image 437
Sir Hally Avatar asked Dec 01 '25 19:12

Sir Hally


2 Answers

[TrollModeOn] I had a problem.... tried to solve it with no-sql solution, now i have 2 problems [/TrollModeOff].

As it seems to me, no-sql solution isn't good for handling so much filter's stuff. I'd start from sql-based solution. E.g. if we have ms sql server we can use user defined table types for filters, some kind of:

CREATE TYPE [FilterTable] AS TABLE(
    [id] [int] NOT NULL   --or any datatype needed
)

After that you can pass table type as a parameter to filtering stored procedure (or do it with sql query), like:

CREATE PROCEDURE [SomeFilterProcedureName]
    @Filter1 FilterTable READONLY,
    @Filter2 FilterTable READONLY
    ....

And your query would be kind of:

SELECT
    field1,
    field2,
    field3
FROM MyTable t
WHERE
    (@Filter1 IS NULL OR t.field1 IN (SELECT id FROM @Filter1))
    AND (@Filter2 IS NULL OR t.field2 IN (SELECT id FROM @Filter2))
    ....
ORDER BY
    whatever

So basically you check if your parameters contains some values, if yes - you filter out column values according to filter-parameter data.

RDBMS do exellent work storing, finding, filtering and sorting huge amounts of data, but you'l need to tune it right way to make it work faster, e.g. you will need to set up your indexes correct. Also you can cache data for some period, but make sure you build cache key correct depending on varying parameters.

If your db server isn't good enough to handle 200 queries per second you might want to make a cluster or keep several db servers with same data up and use some kind of db balancer.

upd: it's too big to place it in comments

It the worst case he can select "All" for every 11 filter and we have to sort 192 million records to find 20-100 with the lowest cost

All filter, lowest cost? Isn't it same as: Select top(20) * from someTableName order by cost.

  1. Db Locks. Work better on your indexes and queries
  2. Sorting. Ok, you got 100million records that fit filters. How are you going to sort em? QSort, MergeSort, BubbleSort? Or maybe stackoverflowSort? Do you know which algorythm you have to choose? But first - DBMS knows, and it choses best algorythm for case, cause it has statistics, and second - of course data is stored preordered in indexes. So every 100m record sort operation will kill no-sql solution but will work perfect on rdbms
  3. High load. Isn't it what we're talking about? In your case it's not real highload there. There are companies that have 100-150 million monthly active users with hella big databases, thousand's of queries per second, and yes, they use rdbms. Dozens of servers, sharding, balancing, and it works perfect.
like image 70
Sergio Avatar answered Dec 03 '25 12:12

Sergio


In-memory solution may be feasible. Since you need to store 12 values x 200M records, you'll need about 20GB of RAM net (assuming 8 bytes per value). You'll need to optimize (stoing 1/2/4 bytes values where possible and disable memory alignments). Practically, you'll probably need a 64GB or stronger machine.

One think you can't afford is using data structures that requires tons of small memory allocations. Even if you'll store the data in a single huge buffer, you'll probably need many small allocations for tree-structure indexes.

There is another reason why trees are not so good for your problem: Since the user may select a set of values for each filter, you'll need to traverse the tree in search for any combination. This can be a huge number of tree traversals.

How about a simpler solution? Select the 2 filters that divide the data-set to the maximal number of groups (this would probably be the filters with the ~5000 values). Use a 2D array. In each cell, if it is not empty, store an array of struct of all remaining 10 values (9 filters + cost). You can sort these arrays by the 3rd most dominant filter.

Upon a user query, determine the relevant cells in the 2D array and check your input against each of the values in the relevant cell (which is sorted by he 3rd most dominant filter). For most cells, you'll have much less than 1000 values to check against.

Depending on your data distribution, you can save some memory by using a sparse matrix instead of a 2D array. Some .NET sparse matrix implementations are available online.

like image 41
Lior Kogan Avatar answered Dec 03 '25 12:12

Lior Kogan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!