My web searches didn't turn up anything useful and maybe noone has done this yet. While I have done some processing freebase dumps and working with rdf and arangodb, my experience is still very limited and I'd like to hear opinions/suggestions/experiences on the topic.
A few things I'm wondering about:
- Has anyone ever imported a freebase dump into ArangoDb?
- Is there a tool to help accomplish this?
- What would be a strategy to manually do this?
- Or maybe it's just a bad idea and shouldn't be done?
Some of the challenges I'd be expecting are:
- No ordering guarantee in the rdf data (afaik). Say I'm interested in a certain person and some information referenced by the /people/person instance appears in the dump before the actual person - then I have to go trough the dump a second time to find this referenced information
- In terms of storing the data, one could make a collection per type and add references between them or save all properties in the toplevel type one is interested in (per the schema, a /people/person includes /common/topic - from an OO-perspective freebase does multiple inheritance, which may lanugage of choice (java) does not support)
- One would likely have to go trough the dump at least twice, once to collect and store the entities and their properties, and another time to add the graph edges between them)
Update
Currently, I go trough the dump several times. The steps are roughly as follows:
- Split the 28GB gzip (250GB uncompressed) into smaller gzip files a 5M lines, this results in about 550 files
- Go trough each of the files, look for the triples that declare a certain type and store the subjects (freebase namespace + mid) in one file per type I'm interested in
- (a) Go trough each of the files again, since I now know the mids, I can assemble the full objects. These are held in memory as much as possible but persisted to disk, one json file per object (we can't be sure whether the object is complete until the entire dump has been processed)
- Go trough all the files on disk and load them into arangodb
It works but it's slow and strikes me as inefficient to go trough the dump this many times. And there will be more passes trough the dump, during/after (a) we are discovering many more entities that are related to the core entities I'm interested in.
And making millions of requests to the freebase api won't be much better either.
So that's a bit of background on why I'm interested in this topic and if there were a pre-made solution for that would be nice.
A similar thing has been done with data from Wikipedia in this project. I'm not aware of a Freebase dump though, but it should be very similar to a Wikipedia dump, shouldn't it? The steps you would need to do are the following:
- Convert the data from Freebase into JSON format in a form that you would like to store them in your ArangoDB instance.
- Use arangoimp to do the import.