I have a table in postgresql containing some cars +1000000 records:
+----+--------+------+---------+-----------+-------------+------------+------------+
| id | price | year | mileage | fuel_type | body_type | brand | model |
+----+--------+------+---------+-----------+-------------+------------+------------+
| 1 | 4894 | 2011 | 121842 | "Benzin" | "Sedan" | "Toyota" | "Yaris" |
| 2 | 4989 | 2012 | 33901 | "Benzin" | "Hatchback" | "Renault" | "Twingo" |
| 3 | 4990 | 2013 | 55105 | "Benzin" | "Hatchback" | "Renault" | "Twingo" |
| 3 | 5290 | 2013 | 20967 | "Benzin" | "Hatchback" | "Renault" | "Twingo" |
| 5 | 5594 | 2008 | 121281 | "Benzin" | "Hatchback" | "Mercedes" | "A170" |
| 6 | 4690 | 2012 | 71303 | "Benzin" | "Hatchback" | "Renault" | "Twingo" |
| 7 | 5290 | 2013 | 58300 | "Benzin" | "Hatchback" | "Renault" | "Twingo" |
| 8 | 5890 | 2013 | 35732 | "Benzin" | "Hatchback" | "Renault" | "Twingo" |
| 9 | 5990 | 2013 | 38777 | "Benzin" | "Hatchback" | "Renault" | "Twingo" |
| 10 | 6180 | 2013 | 69491 | "Benzin" | "Hatchback" | "VW" | "up!" |
| 11 | 6490 | 2012 | 72900 | "Benzin" | "Sedan" | "Renault" | "Clio III" |
| 12 | 6790 | 2012 | 49541 | "Benzin" | "Hatchback" | "Renault" | "Clio III" |
| 13 | 6790 | 2012 | 46377 | "Benzin" | "Hatchback" | "Renault" | "Clio III" |
| 14 | 6790 | 2012 | 45200 | "Benzin" | "Hatchback" | "Renault" | "Clio III" |
| 15 | 6894 | 2007 | 108840 | "Benzin" | "Sedan" | "VW" | "Golf V" |
| 16 | 6990 | 2009 | 54200 | "Benzin" | "Sedan" | "Renault" | "Mégane" |
| 17 | 6990 | 2012 | 40652 | "Benzin" | "Hatchback" | "Renault" | "Clio III" |
| 18 | 6990 | 2012 | 38080 | "Benzin" | "Sedan" | "Renault" | "Clio III" |
| 19 | 7290 | 2012 | 28600 | "Benzin" | "Hatchback" | "Renault" | "Clio III" |
| 20 | 7290 | 2013 | 52800 | "Benzin" | "Hatchback" | "Renault" | "Twingo" |
+----+--------+------+---------+-----------+-------------+------------+------------+
I would like to create a recommendation engine, that can return the 20 most "similar" matches based on some varying criteria e.g. When a user does a search for: brand = 'Renault' AND price < 60000 AND year > 2010
, I want to present, outside the search result some other, more loose results with other cars, that is similar, but doesn't necessarily match all the search criteria exact.
I have tried making some rule based code in ruby, that does something like:
Based on this code, I generate an SQL query with where and order by clauses.
The problem however is, that things get's huge, as I have like 20 different columns I would like to optionally take into consideration, base on the initial criteria. Also I want the recommendation to be backward compatible in the sense that I don't want to just do a simple filtering (WHERE
) query, that in cases might end up returning zero matches. Instead I want it do something similar to when you use text similarity algorithms, where you can compare one phrase to all and get a comparison score for all of them which you can then sort by.
I'm super puzzled about how I could implement this, in an approach this is NOT defining 1000 rules and if/then statements to generate an SQL query. Is there some other technique I could use, or maybe another technology than postgresql?
Calculate weighted deviation for each numerical property:
deviation = abs(actual_value- expected_value)* property_weight
Apply a simplified calculation for textual properties:
deviation = (actual_value <> expected_value)::int* property_weight
Recommend positions in ascending order of sum of deviations.
Example. We are looking for Renault Twingo Hatchback from 2012 with mileage 50000 and price 6000:
select *,
abs(price- 6000)* 100+
abs(year- 2012)* 10000+
abs(mileage- 50000)* 1+
(body_type <> 'Hatchback')::int* 40000+
(brand <> 'Renault')::int* 100000+
(model <> 'Twingo')::int* 50000
as recommendation
from cars
order by recommendation
limit 10;
id | price | year | mileage | fuel_type | body_type | brand | model | recommendation
----+-------+------+---------+-----------+-----------+---------+----------+----------------
9 | 5990 | 2013 | 38777 | Benzin | Hatchback | Renault | Twingo | 22223
8 | 5890 | 2013 | 35732 | Benzin | Hatchback | Renault | Twingo | 35268
7 | 5290 | 2013 | 58300 | Benzin | Hatchback | Renault | Twingo | 89300
4 | 5290 | 2013 | 20967 | Benzin | Hatchback | Renault | Twingo | 110033
3 | 4990 | 2013 | 55105 | Benzin | Hatchback | Renault | Twingo | 116105
2 | 4989 | 2012 | 33901 | Benzin | Hatchback | Renault | Twingo | 117199
12 | 6790 | 2012 | 49541 | Benzin | Hatchback | Renault | Clio III | 129459
13 | 6790 | 2012 | 46377 | Benzin | Hatchback | Renault | Clio III | 132623
14 | 6790 | 2012 | 45200 | Benzin | Hatchback | Renault | Clio III | 133800
20 | 7290 | 2013 | 52800 | Benzin | Hatchback | Renault | Twingo | 141800
(10 rows)
You can easily calibrate the algorithm by changing the weights of properties.
To get more sophistcated approximations of textual properties you can assign numeric values to the properites in auxiliary tables like this:
create table models(id serial primary key, model text, value integer);
insert into models (model, value) values
('Twingo', 10),
('Clio III', 11), -- Clio is more similar to Twingo than to Laguna
('Laguna', 18)
--- etc
and use the values as numeric properties in the main query, e.g.:
select cars.*,
abs(price- 6000)* 100+
abs(year- 2012)* 10000+
abs(mileage- 50000)* 1+
(body_type <> 'Hatchback')::int* 40000+
(brand <> 'Renault')::int* 100000+
abs(models.value- 10)* 50000 -- 10 is a numeric value for Twingo
as recommendation
from cars
join models using(model)
order by recommendation
limit 10;
A note on optimization. If you can rigidly define the scope of any property, place it in the WHERE clause for better performance. For example, if the query cannot return a brand other than the desired one then it makes no sense to calculate the deviation of this property:
select *,
abs(price- 6000)* 100+
abs(year- 2012)* 10000+
abs(mileage- 50000)* 1+
(body_type <> 'Hatchback')::int* 40000+
(model <> 'Twingo')::int* 50000
as recommendation
from cars
where brand = 'Renault' -- !
order by recommendation
limit 10;
Apply machine learning techniques.
MADlib http://madlib.incubator.apache.org/ is an extension to Postgres that gives you an ability to work with various machine learning algorithms right inside the database. It is worth to learn and try.
Start with linear regression for your vectors and then also try random forests and other algorithms and compare what works better in your case (the trick to evaluate the algorithm's quality is simple: you take all your the data you have, use 70-80% of it to train, and then use the remainder to get estimations from the trained engine -- and then use some function to calculate the deviation error, usually people use mean square error approach).
Also, I can recommend an awesome Stanford online course, with online book and lectures published on Youtube (all free!): http://mmds.org/. Various modern approaches to building recommendation engines are described very well in it.
Ideally, you can cache the "text part" (every text related to the row) in a column with type tsvector, this will give you the ability to perform full text search and even give a "weight" to each word so that it is more important when performing the search.
For example, you can give more weight to brand name so that the results will be sorted with that in mind, showing all those of the same brand. Let's assume you have a tsvector column named "fulltext": 'Clio:1A Renault:2B,4C Benzin:5D'::tsvector;
, you can search for it with a tsquery
like 'Renault & Clio'::tsquery;
and it will give results for every Renault and every Clio available, but it will put on top Clio
first and Renault
next. Be aware that if by any chance a Mercedes Clio
exists, it will show that too.
The documentation is very explicit and with some examples, I recommend digging into it.
That being said, the database won't do the work for you in this case. If a Clio is a close match only because they have the same brand (Renault), yes it will do the trick. However if you use (mentally) other arguments like the size of the car, if it's a city-car and such things, you are the only one who can design that algorithm. For example the price range part is not something that any full text search will do for you, you actively have to check if a number is included and eventually sort (unless the number is an exact match).
In the end, your job will be exactly that, create a function that is "smart" based on the user input and will define a complex query that you can run against the database. It's a long process but definitely doable. Try to be smart but not too much and in any case the tsvector
will cover you for all text columns, which will reduce a lot the amount of columns you have.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With