Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which DB to choose for finding best matching records?

Tags:

sql

mysql

I'm storing an object in a database described by a lot of integer attributes. The real object is a little bit more complex, but for now let's assume that I'm storing cars in my database. Each car has a lot of integer attributes to describe the car (ie. maximum speed, wheelbase, maximum power etc.) and these are searchable by the user. The user defines a preferred range for each of the objects and since there are a lot of attributes there most likely won't be any car matching all the attribute ranges. Therefore the query has to return a number of cars sorted by the best match.

At the moment I implemented this in MySQL using the following query:

SELECT *, SQRT( POW((a < min_a)*(min_a - a) + (a > max_a)*(a - max_a), 2) +
                POW((b < min_b)*(min_b - b) + (b > max_b)*(b - max_b), 2) +
                ... ) AS match
WHERE a < (min_a - max_allowable_deviation) AND a > (max_a + max_allowable_deviation) AND ...
ORDER BY match ASC

where a and b are attributes of the object and min_a, max_a, min_b and max_b are user defined values. Basically the match is the square root of the sum of the squared differences between the desired range and the real value of the attribute. A value of 0 meaning a perfect match.

The table contains a couple of million records and the WHERE clausule is only introduced to limit the number of records the calculation is performed on. An index is placed on all of the queryable records and the query takes like 500ms. I'd like to improve this number and I'm looking into ways to improve this query.

Furthermore I am wondering whether there would be a different database better suited to perform this job. Moreover I'd very much like to change to a NoSQL database, because of its more flexible data scheme options. I've been looking into MongoDB, but couldn't find a way to solve this problem efficiently (fast).

Is there any database better suited for this job than MySQL?

like image 877
Ewout Kleinsmann Avatar asked Jul 24 '11 00:07

Ewout Kleinsmann


People also ask

Which DB is best for transactions?

“Distributed SQL” is required where millions of transactions should be handled in a globally distributed database. Multi-Master clustering and Multi-Node data warehousing are required (OLAP). A multi-model database is required, i.e., one database to handle structured, semi-structured, graph, and columnar data.

How do I find matching records in SQL?

(INNER) JOIN : Returns records that have matching values in both tables. LEFT (OUTER) JOIN : Returns all records from the left table, and the matched records from the right table. RIGHT (OUTER) JOIN : Returns all records from the right table, and the matched records from the left table.

Which database is best for millions of records?

Oracle Database Oracle has provided high-quality database solutions since the 1970s. The most recent version of Oracle Database was designed to integrate with cloud-based systems, and it allows you to manage massive databases with billions of records.


1 Answers

Take a look at R-trees. (The pages on specific variants go in to a lot more detail and present pseudo code). These data structures allow you to query by a bounding rectangle, which is what your problem of searching by ranges on each attribute is.

Consider your cars as points in n-dimensional space, where n is the number of attributes that describe your car. Then given a n ranges, each describing an attribute, the problem is the find all the points contained in that n-dimensional hyperrectangle. R-trees support this query efficiently. MySQL implements R-trees for their spatial data types, but MySQL only supports two-dimensional space, which is insufficient for you. I'm not aware of any common databases that support n-dimensional R-trees off the shelf, but you can take some database with good support for user-defined tree data structures and implement R-trees yourself on top of that. For example, you can define a structure for an R-tree node in MongoDB, with child pointers. You will then implement the R-tree algorithms in your own code while letting MongoDB take care of storing the data.

Also, there's this C++ header file implementing of an R-tree, but currently it's only an in-memory structure. Though if your data set is only a few million rows, it seems feasible to just load this memory structure upon startup and update it whenever new cars are added (which I assume is infrequent).

like image 168
Kui Tang Avatar answered Nov 15 '22 03:11

Kui Tang