I have an application with a pretty large active dataset (let's say cars) with around 2 million active rows of data. Each "car" has a multitude of attributes (columns) like price, mileage, year, brand, model, fuel type etc. etc.
Now on the /show page for each car in my web application I need to produce a list of the top 10 most "similar" cars. As I never "know" if a car is a very common or very rare kind of car (before actually doing the db query) I have designed a pattern where I hardly do any filtering (WHERE-clauses) in the "similar-cars"-query. Instead I do a lot of ORDER BY-clauses, combined with CASE WHEN-statements based on the current car in view's data. Let's say a user looks at a Ford Focus, 2010, 30.000km, Gasoline, 12490EUR from around Düsseldorf car. Then I would do something like:
SELECT "cars".*
FROM de."cars"
WHERE ("cars"."id" != 24352543)
AND "cars"."sales_state" = 'onsale'
AND (cars.is_disabled IS NOT TRUE)
ORDER BY
CASE WHEN ABS(cars.price - 12490) < cars.price * 0.2 THEN 1 WHEN ABS(cars.price - 12490) < cars.price * 0.4 THEN 2 WHEN ABS(cars.price - 12490) < cars.price * 0.6 THEN 3 ELSE 4 END,
CASE WHEN fuel_type = 'Gasoline' THEN 0 ELSE 1 END,
ABS(cars.price - 12490),
CASE WHEN ST_Distance( ST_GeographyFromText( 'SRID=4326;POINT(' || cars.longitude || ' ' || cars.latitude || ')' ), ST_GeographyFromText('SRID=4326;POINT(12.172130 48.162990)') ) <= 30000 THEN 1 WHEN ST_Distance( ST_GeographyFromText( 'SRID=4326;POINT(' || cars.longitude || ' ' || cars.latitude || ')' ), ST_GeographyFromText('SRID=4326;POINT(12.172130 48.162990)') ) <= 100000 THEN 2 ELSE 3 END,
ABS(cars.year - 2010),
ABS(cars.km - 30000)
LIMIT 10
In reality there are even more ordering clauses.
Now this is handy because no matter how "easy" it is to find 10 "relevant" cars similar to the current car, the query is always gonna return something - the trouble is - it's slow and almost impossible to index from my knowledge. Doing this on 2 million records, even if I have a well-tunes super fast dedicated PostgreSQL 11, 300GB ram, 10 SSD RAID 10 32 core server, this will still take me around 2-4 seconds, time I don't have. I need it down to < 200ms.
I have been scratching my head for approaches on solving this, but as I'm unexperienced with solving problems like this at scale, I am unsure which approach would pay better off, solving the challenge. Some of the ideas I have:
WHERE) on some column in terms, (e.g. starting by limiting the data on a subset of prices), to reduce the dataset. Then if results are returned, great, else do another slightly broader query and so on and so on.For that possible sql complexity and diversion (many different patterns) and that timings you mention (250 ms) i should enforce sql to follow a 'plan' as simple and effective as possible, by breaking down filters one at a time.
i do that working my (each time) random set of filters in a loop, from filters i judge more important, selecting PKs and then joining Pks on every other loop.
This way you have a chance of getting best time at all random filter sets plus you may know 0 results pretty fast.
More details-Example: First you focus at the item you search which is car.id i believe. So you need a set of Car.id values upon random filters. Let's say you have 20 possible filters. Each filter leads to a set of car.id values. Some filters may work directly at the table where car.id is. Some other may need joins to 1-2 maybe 3 tables. All filters together however may need 10-15 joins altogether. The least the tables joined the better chance to get a good plan.
Assume you have 3 filters, filter 2, 7 and 14. Joining e.g. 12 tables and filter with those 3 filter may or may not be efficient. And if it is, another combination will not be. So what i propose is(pseudo-code):
procedure/table function get carids as
for each optional filter 1 to 20
if filter is set
select car.id from car (possible joins) where filter=filter.value and car.id
in (previous car.id found)
if count(car.id)=0 end and return no results
end if
end for
return car.id collected
optionally you can specify the order the filters are processed. if you know that from a set of 5-6 filters at least one of them is used at 99% of searches then sorting them first will result to narrow car.id values to a range of 0-few at the first 5 selects max
You cannot get exactly that fast, because you have to perform a top-N-sort on all the query results, which will be slow even if you crank up work_mem.
The ORDER BY clause is not indexable as it is.
If you are a bit more flexible about your query, maybe you could try something like this:
First query:
WITH priced_cars AS (
SELECT SELECT cars.*
FROM de.cars
WHERE (cars.id != 24352543)
AND cars.sales_state = 'onsale'
AND (cars.is_disabled IS NOT TRUE)
AND cars.price BETWEEN 12490*5/6 AND 12490*5/4
)
SELECT * FROM priced_cars
ORDER BY
CASE WHEN fuel_type = 'Gasoline' THEN 0 ELSE 1 END,
ABS(price - 12490),
CASE
WHEN ST_Distance( ST_GeographyFromText( 'SRID=4326;POINT(' || longitude || ' ' || latitude || ')' ), ST_GeographyFromText('SRID=4326;POINT(12.172130 48.162990)') ) <= 30000
THEN 1
WHEN ST_Distance( ST_GeographyFromText( 'SRID=4326;POINT(' || longitude || ' ' || latitude || ')' ), ST_GeographyFromText('SRID=4326;POINT(12.172130 48.162990)') ) <= 100000
THEN 2
ELSE 3
END,
ABS(year - 2010),
ABS(km - 30000)
LIMIT 10;
This query can use an index like this:
CREATE INDEX ON de.cars (price)
WHERE sales_state = 'onsale' AND is_disabled IS NOT TRUE;
That would correspond to only the cars where your first ORDER BY column would be 1, but it can be fast since it can use an index.
If you find 10 cars that way, you are done.
Else run a second query with a WHERE condition for price that would correspond to the next best criterion on price, which again can use the same index, but will be slower.
Proceed that way until you have got 10 cars (the last query will have no condition on price and will be as slow as before).
This will be a loss if there you have to run four such queries, because you cannot find 10 cars in the first three queries, but it might be faster in the other case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With