Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Querying large RDF Datasets out of memory

I want to download two or more datasets on my machine and be able to start a SPARQL endpoint for each. I tried Fuseki which is part of the Jena project. However, it loads the whole dataset in memory, which is not very much desired if I'm intending to query large datasets like DBpedia given that I intend to do other stuff (starting multiple SPARQL endpoints and use a federated query system over them).

Just to give you a heads up, I intend to link multiple datasets using SILK, querying them using a FEDX federated query system. If you recommend any change of the systems I'm using, or can give me a tip, that would be great. It will also be great of a help if you suggest a dataset that can fit in this project.

like image 437
user2467278 Avatar asked Jun 09 '13 02:06

user2467278


2 Answers

Jena's Fuseki can use TDB as a storage mechanism, and TDB stores things on disk. The TDB docmentation on caching on 32 and 64 bit Java systems discusses the way that the file contents are mapped into memory. I do not believe that TDB/Fuseki loads the entire dataset into memory; this just is not feasible for large datasets, yet TDB can handle rather large datasets. I think what you should consider doing is using tdbloader to create a TDB store; then you can point Fuseki to it.

There's an example of setting up a TDB store in this answer. In there, the query is performed with tdbquery, but according to the Running a Fuseki server section of the documentation, all you will need to do to start Fuseki with the same TDB store is use the --loc=DIR option:

  • --loc=DIR
    Use an existing TDB database. Create an empty one if it does not exist.
like image 147
Joshua Taylor Avatar answered Nov 09 '22 22:11

Joshua Taylor


As Joshua said, Jena's Fuseki uses TDB so it can store very large ontologies without using a lot of resources. For example, you can load the Yago2 taxonomy into it and use only about 600MB of RAM. You do not need to load Fuseki into your Java project, you can just run it from the command line and query it inside your project.

Load it at the Windows command line by the following:

java -jar c:\your_ontology_directory\fuseki-server.jar \
  --file=your_ontology.rdf /your_namespace

Then you can run a SPARQL query against it with any GET/POST application (even in your browser):

http://localhost:3030/your_namespace/sparql?query=SELECT * { ?s ?p ?o }

The results are, by default, returned in XML format.

<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
  <head>
    <variable name="s"/>
    <variable name="p"/>
    <variable name="o"/>
  </head>
  <results>
    <result>
      <binding name="s">
        <uri>http://yago-knowledge/resource/wordnet_gulag_103467887</uri>
      </binding>
      <binding name="p">
        <uri>http://www.w3.org/2000/01/rdf-schema#subClassOf</uri>
      </binding>
      <binding name="o">
        <uri>http://yago-knowledge/resource/wordnet_prison_camp_104005912</uri>
      </binding>
    </result>
    …
like image 25
Ali R Avatar answered Nov 09 '22 20:11

Ali R