Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is rdflib so slow?

Tags:

rdf

sparql

rdflib

I have a large rdf file:

  • size: 470MB
  • number of lines: almost 6 million
  • unique triple subjects: about 650,000
  • triple amount: about 4,200,000

I loaded the rdf definition into the berkeley db backend of rdflib via:

graph = rdflib.Graph("Sleepycat")
graph.open("store", create=True)
graph.parse("authorities-geografikum_lds.rdf")

It took many hours to complete on my notebook. The computer isn't really powerful (Intel B980 CPU, 4GB of RAM, no SSD) and the definition is large - but still, many hours for this task seems rather long. Maybe it is partly due to indexing / optimizing the data structures?

What is really irritating is the time it takes for the following queries to complete:

SELECT (COUNT(DISTINCT ?s) as ?c)
WHERE {
    ?s ?p ?o
}

(Result: 667,445)

took over 20 minutes and

SELECT (COUNT(?s) as ?c)
WHERE {
    ?s ?p ?o
}

(Result: 4,197,399)

took over 25 minutes.

I my experience, a Relational DBMS filled with comparable data would finish a corresponding query in a small fraction of the time given appropriate indexing.

So my questions are:

Why is rdflib so slow (especially for queries)?

Can I tune / optimize the database, like I can with indexes in a RDBMS?

Is another (free and "compact") triple store better suited for data of this size, performance-wise?

like image 238
Johann Gottfried Avatar asked Nov 16 '22 07:11

Johann Gottfried


1 Answers

I experienced a similar slow behavior of RDFLIB. For me, a possible solution lay in changing the underyling graph storage to Oxrdflib, which improved the speed of the SPARQL-query drastically.

see: https://pypi.org/project/oxrdflib/

like image 87
achiminator Avatar answered Nov 22 '22 16:11

achiminator