I have a large rdf file:
I loaded the rdf definition into the berkeley db backend of rdflib via:
graph = rdflib.Graph("Sleepycat")
graph.open("store", create=True)
graph.parse("authorities-geografikum_lds.rdf")
It took many hours to complete on my notebook. The computer isn't really powerful (Intel B980 CPU, 4GB of RAM, no SSD) and the definition is large - but still, many hours for this task seems rather long. Maybe it is partly due to indexing / optimizing the data structures?
What is really irritating is the time it takes for the following queries to complete:
SELECT (COUNT(DISTINCT ?s) as ?c)
WHERE {
?s ?p ?o
}
(Result: 667,445)
took over 20 minutes and
SELECT (COUNT(?s) as ?c)
WHERE {
?s ?p ?o
}
(Result: 4,197,399)
took over 25 minutes.
I my experience, a Relational DBMS filled with comparable data would finish a corresponding query in a small fraction of the time given appropriate indexing.
So my questions are:
Why is rdflib so slow (especially for queries)?
Can I tune / optimize the database, like I can with indexes in a RDBMS?
Is another (free and "compact") triple store better suited for data of this size, performance-wise?
I experienced a similar slow behavior of RDFLIB. For me, a possible solution lay in changing the underyling graph storage to Oxrdflib, which improved the speed of the SPARQL-query drastically.
see: https://pypi.org/project/oxrdflib/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With