Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to reduce the size of the TDB-backed Jena Dataset?

I am working with a simple Jena dataset, which only has a single ~30 MB RDF file imported. As a part of the application, I am trying to let users query the default graph (or a named graph) and insert the resulting triples from the query into a new named graph. For this, I am using a CONSTRUCT statement to form the result set of triples in RDF form, then putting these triples into a new model (using QueryExecution.execConstruct()) and adding this model to the dataset. This appears to work again, as the dataset gets a new graph node, and the disk size of the TDB database folder grows in size.

The problem here comes up when I try to remove a named graph from the dataset. Using Dataset's removeNamedName("graphName") method, I remove the model from the dataset. Future queries over that model name reveal that it has successfully been removed. However, the disk size of the TDB database folder remains the same size, even after syncing and quitting.

At first I thought that perhaps the database was just marking the deleted files' space as free so that it could be overwritten as new data came in, but this doesn't seem to be the case. If I delete a named graph and replace it immediately after in the same program run, the folder doesn't seem to grow, but if I add a new named graph and delete it in the same run, the folder size gets bigger and the model removal doesn't free up the memory, meaning that after a few runs the database folder is five or ten times its original size without holding any more data.

Any insight or help would be great, thanks again.

like image 896
paul Avatar asked Jun 18 '12 17:06

paul


1 Answers

You may get more insight by asking on the Jena mailing list ([email protected]) but I will try to answer. You may also wish to take a look at the TDB Architecture page on the website.

TDB stores data by building what it calls a Node Table which maps RDF Nodes into 64 bit integer IDs and vice versa. It then builds separate indexes using these integer IDs which allow it to perform the various database scans necessary to answer SPARQL queries.

Adding data potentially adds entries to both of these structures (Node Table and indices) but removing data only removes data from the indices. Thus over time the Node Table will continue to grow even if you delete old data because it doesn't delete from the Node Table.

The practical reasons behind this are twofold:

  1. The integer IDs partly encode file offsets so the ID to Node lookup is a fast file scan therefore as data is deleted you can't delete parts of the Node table without having to rewrite all the Node IDs i.e. the Node table in the ID -> Node direction is a sequential file (helps make inserts very fast)
  2. When data is deleted you don't know whether a Node is used multiple times without doing a complete database scan. Therefore you can't tell whether a Node Table entry should be deleted in the first place. The only viable way to do this would be to implement a full reference counting scheme which would in of itself add complexity to the system and slow down adds and deletes.

Disclaimer - I'm a committer on the Jena project but have never done any work personally on the TDB component so this reflects my best understanding and may not be completely accurate.

like image 77
RobV Avatar answered Sep 20 '22 19:09

RobV