I have a cities table which looks like this.
|id| Name |
|1 | Paris |
|2 | London |
|3 | New York|
I have a tags table which looks like this.
|id| tag |
|1 | Europe |
|2 | North America |
|3 | River |
and a cities_tags table:
|id| city_id | tag_id |
|1 | 1 | 1 |
|2 | 1 | 3 |
|3 | 2 | 1 |
|4 | 2 | 3 |
|5 | 3 | 2 |
|6 | 3 | 3 |
How do I calculate which are the most closely related city? For example. If I were looking at city 1 (Paris), the results should be: London (2), New York (3)
I have found the Jaccard index but I'm unsure as how best to implement this.
You question about How do I calculate which are the most closely related city? For example. If I were looking at city 1 (Paris), the results should be: London (2), New York (3) and based on your provided data set there is only one thing to relate that is the common tags between the cities so the cities which shares the common tags would be the closest one below is the subquery which finds the cities (other than which is provided to find its closest cities) that shares the common tags
SELECT * FROM `cities` WHERE id IN (
SELECT city_id FROM `cities_tags` WHERE tag_id IN (
SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )
I assume you will input one of the city id or name to find their closest one in my case "Paris" has the id one
SELECT tag_id FROM `cities_tags` WHERE city_id=1
It will find all the tags id which paris has then
SELECT city_id FROM `cities_tags` WHERE tag_id IN (
SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )
It will fetch all the cities except paris that has the some same tags that paris also has
Here is your Fiddle
While reading about the Jaccard similarity/index found some stuff to understand about the what actualy the terms is lets take this example we have two sets A & B
Set A={A, B, C, D, E}
Set B={I, H, G, F, E, D}
Formula to calculate the jaccard similarity is JS=(A intersect B)/(A union B)
A intersect B = {D,E}= 2
A union B ={A, B, C, D, E,I, H, G, F} =9
JS=2/9 =0.2222222222222222
Now move towards your scenario
Paris has the tag_ids 1,3 so we make the set of this and call our Set P ={Europe,River}
London has the tag_ids 1,3 so we make the set of this and call our Set L ={Europe,River}
New York has the tag_ids 2,3 so we make the set of this and call our Set NW ={North America,River}
Calculting the JS Paris with London JSPL = P intersect L / P union L , JSPL = 2/2 = 1
Calculting the JS Paris with New York JSPNW = P intersect NW / P union NW ,JSPNW = 1/3 = 0.3333333333
Here is the query so far which calcluates the perfect jaccard index you can see the below fiddle example
SELECT a.*,
( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index
FROM (
SELECT q.* ,(q.sets + q.parisset) AS `union` ,
(q.sets - q.parisset) AS `intersect`
FROM (
SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets ,
(SELECT GROUP_CONCAT(tag_id SEPARATOR ',') FROM `cities_tags` WHERE city_id= 1)AS parisset
FROM `cities_tags`
LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`)
GROUP BY city_id ) q
) a ORDER BY jaccard_index DESC
In above query i have the i have derived the result set to two subselects in order get my custom calculated aliases
You can add the filter in above query not to calculate the similarity with itself
SELECT a.*,
( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index
FROM (
SELECT q.* ,(q.sets + q.parisset) AS `union` ,
(q.sets - q.parisset) AS `intersect`
FROM (
SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets ,
(SELECT GROUP_CONCAT(tag_id SEPARATOR ',') FROM `cities_tags` WHERE city_id= 1)AS parisset
FROM `cities_tags`
LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`) WHERE cities.`id` !=1
GROUP BY city_id ) q
) a ORDER BY jaccard_index DESC
So the result shows Paris is closely related to London and then related to New York
Jaccard Similarity Fiddle
select c.name, cnt.val/(select count(*) from cities) as jaccard_index
from cities c
inner join
(
select city_id, count(*) as val
from cities_tags
where tag_id in (select tag_id from cities_tags where city_id=1)
and not city_id in (1)
group by city_id
) as cnt
on c.id=cnt.city_id
order by jaccard_index desc
This query is statically referring to city_id=1
, so you'll have to make that a variable in both the where tag_id in
clause, and the not city_id in
clause.
If I understood the Jaccard index properly, then it also returns that value ordered by the 'most closely related'. The results in our example look like this:
|name |jaccard_index |
|London |0.6667 |
|New York |0.3333 |
With a better understanding of how to implement Jaccard Index:
After reading a bit more on wikipedia about the Jaccard Index, I've come up with a better way implement a query for our example dataset. Essentially, we will be comparing our chosen city with each other city in the list independently, and using the count of common tags divided by the count of distinct total tags selected between the two cities.
select c.name,
case -- when this city's tags are a subset of the chosen city's tags
when not_in.cnt is null
then -- then the union count is the chosen city's tag count
intersection.cnt/(select count(tag_id) from cities_tags where city_id=1)
else -- otherwise the union count is the chosen city's tag count plus everything not in the chosen city's tag list
intersection.cnt/(not_in.cnt+(select count(tag_id) from cities_tags where city_id=1))
end as jaccard_index
-- Jaccard index is defined as the size of the intersection of a dataset, divided by the size of the union of a dataset
from cities c
inner join
(
-- select the count of tags for each city that match our chosen city
select city_id, count(*) as cnt
from cities_tags
where tag_id in (select tag_id from cities_tags where city_id=1)
and city_id!=1
group by city_id
) as intersection
on c.id=intersection.city_id
left join
(
-- select the count of tags for each city that are not in our chosen city's tag list
select city_id, count(tag_id) as cnt
from cities_tags
where city_id!=1
and not tag_id in (select tag_id from cities_tags where city_id=1)
group by city_id
) as not_in
on c.id=not_in.city_id
order by jaccard_index desc
The query is a bit lengthy, and I don't know how well it will scale, but it does implement a true Jaccard Index, as requested in the question. Here are the results with the new query:
+----------+---------------+
| name | jaccard_index |
+----------+---------------+
| London | 1.0000 |
| New York | 0.3333 |
+----------+---------------+
Edited again to add comments to the query, and take into account when the current city's tags are a subset of the chosen city's tags
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With