Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Query for best match to a string with SPARQL?

I have a list with movie titles and want to look these up in DBpedia for meta information like "director". But I have trouble to identify the correct movie with SPARQL, because the titles sometimes don't exactly match.

How can I get the best match for a movie title from DBpedia using SPARQL?

Some problematic examples:

  • My List: "Die Hard: with a Vengeance" vs. DBpedia: "Die Hard with a Vengeance"
  • My List: "Hachi" vs. DBpedia: "Hachi: A Dog's Tale"

My current approach is to query the DBpedia endpoint for all movies and then filter by checking for single tokens (without punctuations), order by title and return the first result. E.g.:

SELECT ?resource ?title ?director WHERE {
   ?resource foaf:name ?title .
   ?resource rdf:type schema:Movie .
   ?resource dbo:director ?director .
   FILTER (
      contains(lcase(str(?title)), "die") && 
      contains(lcase(str(?title)),"hard")
   )
}
ORDER BY (?title)
LIMIT 1

This approach is very slow and also sometimes fails, e.g.:

SELECT ?resource ?title ?director WHERE {
   ?resource foaf:name ?title .
   ?resource rdf:type schema:Movie .
   ?resource dbo:director ?director .
   FILTER (
      contains(lcase(str(?title)), "hachi") 
   )
}
ORDER BY (?title)
LIMIT 10

where the correct result is on second place:

  resource                                          title                        director
  http://dbpedia.org/resource/Chachi_420            "Chachi 420"@en              http://dbpedia.org/resource/Kamal_Haasan
  http://dbpedia.org/resource/Hachi:_A_Dog's_Tale   "Hachi: A Dog's Tale"@en     http://dbpedia.org/resource/Lasse_Hallström    
  http://dbpedia.org/resource/Hachiko_Monogatari    "Hachikō Monogatari"@en      http://dbpedia.org/resource/Seijirō_Kōyama
  http://dbpedia.org/resource/Thachiledathu_Chundan "Thachiledathu Chundan"@en   http://dbpedia.org/resource/Shajoon_Kariyal

Any ideas how to solve this problem? Or even better: How to query for best matches to a string with SPARQL in general?

Thanks!

like image 345
dynobo Avatar asked Jul 30 '16 07:07

dynobo


1 Answers

I adapted the regex-approach mentioned in the comments and came up with a solution that works pretty well, better than anything I could get with bif:contains:

   SELECT ?resource ?title ?match strlen(str(?title)) as ?lenTitle strlen(str(?match)) as ?lenMatch

   WHERE {
      ?resource foaf:name ?title .
      ?resource rdf:type schema:Movie .
      ?resource dbo:director ?director .
      bind( replace(LCASE(CONCAT('x',?title)), "^x(die)*(?:.*?(hard))*(?:.*?(with))*.*$", "$1$2$3") as ?match ) 
   }

   ORDER BY DESC(?lenMatch) ASC(?lenTitle)

   LIMIT 5

It's not perfect, so I'm still open for suggestions.

like image 162
dynobo Avatar answered Nov 03 '22 22:11

dynobo