Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to access members of an rdf list with rdflib (or plain sparql)

What is the best way to access the members of an rdf list? I'm using rdflib (python) but an answer given in plain SPARQL is also ok (this type of answer can be used through rdfextras, a rdflib helper library).

I'm trying to access the authors of a particular journal article in rdf produced by Zotero (some fields have been removed for brevity):

<rdf:RDF
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns:z="http://www.zotero.org/namespaces/export#"
 xmlns:dcterms="http://purl.org/dc/terms/"
 xmlns:bib="http://purl.org/net/biblio#"
 xmlns:foaf="http://xmlns.com/foaf/0.1/"
 xmlns:dc="http://purl.org/dc/elements/1.1/"
 xmlns:prism="http://prismstandard.org/namespaces/1.2/basic/"
 xmlns:link="http://purl.org/rss/1.0/modules/link/">
    <bib:Article rdf:about="http://www.ncbi.nlm.nih.gov/pubmed/18273724">
        <z:itemType>journalArticle</z:itemType>
        <dcterms:isPartOf rdf:resource="urn:issn:0954-6634"/>
        <bib:authors>
            <rdf:Seq>
                <rdf:li>
                    <foaf:Person>
                        <foaf:surname>Lee</foaf:surname>
                        <foaf:givenname>Hyoun Seung</foaf:givenname>
                    </foaf:Person>
                </rdf:li>
                <rdf:li>
                    <foaf:Person>
                        <foaf:surname>Lee</foaf:surname>
                        <foaf:givenname>Jong Hee</foaf:givenname>
                    </foaf:Person>
                </rdf:li>
                <rdf:li>
                    <foaf:Person>
                        <foaf:surname>Ahn</foaf:surname>
                        <foaf:givenname>Gun Young</foaf:givenname>
                    </foaf:Person>
                </rdf:li>
                <rdf:li>
                    <foaf:Person>
                        <foaf:surname>Lee</foaf:surname>
                        <foaf:givenname>Dong Hun</foaf:givenname>
                    </foaf:Person>
                </rdf:li>
                <rdf:li>
                    <foaf:Person>
                        <foaf:surname>Shin</foaf:surname>
                        <foaf:givenname>Jung Won</foaf:givenname>
                    </foaf:Person>
                </rdf:li>
                <rdf:li>
                    <foaf:Person>
                        <foaf:surname>Kim</foaf:surname>
                        <foaf:givenname>Dong Hyun</foaf:givenname>
                    </foaf:Person>
                </rdf:li>
                <rdf:li>
                    <foaf:Person>
                        <foaf:surname>Chung</foaf:surname>
                        <foaf:givenname>Jin Ho</foaf:givenname>
                    </foaf:Person>
                </rdf:li>
            </rdf:Seq>
        </bib:authors>

        <dc:title>Fractional photothermolysis for the treatment of acne scars: a report of 27 Korean patients</dc:title>
        <dcterms:abstract>OBJECTIVES: Atrophic post-acne scarring remains a therapeutically challe *CUT*, erythema and edema. CONCLUSIONS: The 1550-nm erbium-doped FP is associated with significant patient-reported improvement in the appearance of acne scars, with minimal downtime.</dcterms:abstract>
        <bib:pages>45-49</bib:pages>
        <dc:date>2008</dc:date>
        <z:shortTitle>Fractional photothermolysis for the treatment of acne scars</z:shortTitle>
        <dc:identifier>
            <dcterms:URI>
               <rdf:value>http://www.ncbi.nlm.nih.gov/pubmed/18273724</rdf:value>
            </dcterms:URI>
        </dc:identifier>
        <dcterms:dateSubmitted>2010-12-06 11:36:52</dcterms:dateSubmitted>
        <z:libraryCatalog>NCBI PubMed</z:libraryCatalog>
        <dc:description>PMID: 18273724</dc:description>
    </bib:Article>
    <bib:Journal rdf:about="urn:issn:0954-6634">
        <dc:title>The Journal of Dermatological Treatment</dc:title>
        <prism:volume>19</prism:volume>
        <prism:number>1</prism:number>
        <dcterms:alternative>J Dermatolog Treat</dcterms:alternative>
        <dc:identifier>DOI 10.1080/09546630701691244</dc:identifier>
        <dc:identifier>ISSN 0954-6634</dc:identifier>
    </bib:Journal>
like image 278
tjb Avatar asked Jan 15 '11 22:01

tjb


1 Answers

rdf containers are a pain in general, quite annoying to handle them. I am posting two solutions one without SPARQL and another wit SPARQL. Personally I prefer the second one, the one that uses SPARQL.

Example 1: without SPARQL

To get all the authors for a given article like in your case you could do something like the code I am posting below.

I have added comments so that is self-explains. The most important bit is the use of g.triple(triple_pattern) with this graph function basically you can filter an rdflib Graph and search for the triple patterns you need.

When an rdf:Seq is parsed then predicates of the form :

http://www.w3.org/1999/02/22-rdf-syntax-ns#_1

http://www.w3.org/1999/02/22-rdf-syntax-ns#_2

http://www.w3.org/1999/02/22-rdf-syntax-ns#_3

are created, rdflib retrieve them in random order so you need to sort them to traverse them in the right order.

import rdflib

RDF = rdflib.namespace.RDF

#Parse the file
g = rdflib.Graph()
g.parse("zot.rdf")

#So that we are sure we get something back
print "Number of triples",len(g)

#Couple of handy namespaces to use later
BIB = rdflib.Namespace("http://purl.org/net/biblio#")
FOAF = rdflib.Namespace("http://xmlns.com/foaf/0.1/")

#Author counter to print at the bottom
i=0

#Article for wich we want the list of authors
article = rdflib.term.URIRef("http://www.ncbi.nlm.nih.gov/pubmed/18273724")

#First loop filters is equivalent to "get all authors for article x" 
for triple in g.triples((article,BIB["authors"],None)):

    #This expresions removes the rdf:type predicate cause we only want the bnodes
    # of the form http://www.w3.org/1999/02/22-rdf-syntax-ns#_SEQ_NUMBER
    # where SEQ_NUMBER is the index of the element in the rdf:Seq
    list_triples = filter(lambda y: RDF['type'] != y[1], g.triples((triple[2],None,None)))

    #We sort the authors by the predicate of the triple - order in sequences do matter ;-)
    # so "http://www.w3.org/1999/02/22-rdf-syntax-ns#_435"[44:] returns 435
    # and since we want numberic order we do int(x[1][44:]) - (BTW x[1] is the predicate)
    authors_sorted =  sorted(list_triples,key=lambda x: int(x[1][44:]))

    #We iterate the authors bNodes and we get surname and givenname
    for author_bnode in authors_sorted:
        for x in g.triples((author_bnode[2],FOAF['surname'],None)):
            author_surname = x[2]
        for y in g.triples((author_bnode[2],FOAF['givenname'],None)):
            author_name = y[2]
        print "author(%s): %s %s"%(i,author_name,author_surname)
        i += 1

This example shows how to do this without using SPARQL.

Example 2: With SPARQL

Now there is exactly the same example but using SPARQL.

rdflib.plugin.register('sparql', rdflib.query.Processor,
                       'rdfextras.sparql.processor', 'Processor')
rdflib.plugin.register('sparql', rdflib.query.Result,
                       'rdfextras.sparql.query', 'SPARQLQueryResult')

query = """
SELECT ?seq_index ?name ?surname WHERE {
     <http://www.ncbi.nlm.nih.gov/pubmed/18273724> bib:authors ?seq .
     ?seq ?seq_index ?seq_bnode .
     ?seq_bnode foaf:givenname ?name .
     ?seq_bnode foaf:surname ?surname .
}
"""
for row in sorted(g.query(query, initNs=dict(rdf=RDF,foaf=FOAF,bib=BIB)),
                                                  key=lambda x:int(x[0][44:])):
    print "Author(%s) %s %s"%(row[0][44:],row[1],row[2])

As it shows we still have to do the sorting thing because the library doesn't handle it by itself. In the query the variable seq_index holds the predicate that contains the information about the sequence order and that is the one to do the sort in the lambda function.

like image 119
Manuel Salvadores Avatar answered Sep 20 '22 13:09

Manuel Salvadores