Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to remove duplicates in sparql query

Tags:

sparql

dbpedia

I wrote this query and return list of couples and particular condition. ( in http://live.dbpedia.org/sparql)

SELECT DISTINCT ?actor ?person2 ?cnt
WHERE
{
{
    select DISTINCT ?actor ?person2 (count (?film) as ?cnt) 
    where { 
        ?film    dbo:starring ?actor .
        ?actor dbo:spouse ?person2. 
        ?film    dbo:starring ?person2.
    }
    order by ?actor
}
FILTER (?cnt >9)
}

Problem is that some rows is duplicate. example:

http://dbpedia.org/resource/George_Burns http://dbpedia.org/resource/Gracie_Allen 12

http://dbpedia.org/resource/Gracie_Allen http://dbpedia.org/resource/George_Burns 12

how to remove these duplications? I added gender to ?actor but it damage current result.

like image 496
NASRIN Avatar asked Apr 01 '16 04:04

NASRIN


People also ask

What is optional Sparql?

OPTIONAL is a binary operator that combines two graph patterns. The optional pattern is any group pattern and may involve any SPARQL pattern types. If the group matches, the solution is extended, if not, the original solution is given (q-opt3. rq).


2 Answers

Natan Cox's answer shows the typical way to exclude these kind of pseudo-duplicates. The results aren't actually duplicates, because in one, e.g., George Burns is the ?actor, and in the other he is the ?person2. In many cases, you can add a filter to require that the two things are ordered, and that will remove the duplicate cases. E.g., when you have data like:

:a :likes :b .
:a :likes :c .

and you search for

select ?x ?y where { 
  :a :likes ?x, ?y .
}

you can add filter(?x < ?y) to enforce an ordering between the between ?x and ?y which will remove these pseudo-duplicates. However, in this case, it's a bit trickier, since ?actor and ?person2 aren't found using the same critera. If DBpedia contains

:PersonB dbo:spouse :PersonA

but not

:PersonA dbo:spouse :PersonB

then the simple filter won't work, because you'll never find the triple where the subject PersonA is less than the object PersonB. So in this case, you also need to modify your query a bit to make the criteria symmetric:

select distinct ?actor ?spouse (count(?film) as ?count) {
  ?film dbo:starring ?actor, ?spouse .
  ?actor dbo:spouse|^dbo:spouse ?spouse .
  filter(?actor < ?spouse)
}
group by ?actor ?spouse
having (count(?film) > 9)
order by ?actor

(This query also shows that you don't need a subquery here, you can use having to "filter" on aggregate values.) But the important part is using the property path dbo:spouse|^dbo:spouse to find a value for ?spouse such that either ?actor dbo:spouse ?spouse or ?spouse dbo:spouse ?actor. This makes the relationship symmetric, so that you're guaranteed to get all the pairs, even if the relationship is only declared in one direction.

like image 199
Joshua Taylor Avatar answered Oct 27 '22 18:10

Joshua Taylor


It is not actual duplicates of course since you can look at it from both ways. The way to fix it if you want to is to add a filter. It is a bit of a dirty hack but it only takes on of the 2 rows that are the "same".

SELECT DISTINCT ?actor ?person2 ?cnt
WHERE
{
{
    select DISTINCT ?actor ?person2 (count (?film) as ?cnt) 
    where { 
        ?film    dbo:starring ?actor .
        ?actor dbo:spouse ?person2. 
        ?film    dbo:starring ?person2.
FILTER (?actor < ?person2)


    }
    order by ?actor
}
FILTER (?cnt >9)
}
like image 35
Natan Cox Avatar answered Oct 27 '22 19:10

Natan Cox