Extracting a subset data of Freebase for faster development iteration

Question

I have downloaded the 250G dump of freebase data. I don't want to iterate my development on the big data. I want to extract a small subset of the data (may be a small domain or some 10 personalities and their information). This small subset will make my iterations faster and easier.

What's the best approach to partition the freebase data? Is there any subset download provided by Google/Freebase?

Shawn Simister · Accepted Answer

This is feedback that we've gotten from many people using the data dumps. We're looking into how best to create such subsets. One approach would be to get all the data for a single domain like Film.

Here's how you'd get every RDF triple from the /film domain:

zgrep '\s<http://rdf\.freebase\.com/ns/film.' freebase-rdf-{date}.gz | gzip > freebase-films.gz

The tricky part is that this subset won't contain the names, images or descriptions which you most likely also want. So you'll need to get those like this:

zgrep '\s<http://rdf\.freebase\.com/ns/(type\.object|common\.topic)' freebase-rdf-{date}.gz | gzip > freebase-topics.gz

Then you'll possibly want to filter that subset down to only topic data about films (match only triples that start with the same /m ID) and concatenate that to the film subset.

It's all pretty straight-forward to script this with regular expressions but a lot more work than it should be. We're working on a better long-term solution.

Fredrik · Answer

I wanted to do a similar thing and I came up with the following command line.

gunzip -c freebase-rdf-{date}.gz | awk 'BEGIN { prev_1 = ""} { if (prev_1 != $1) { print '
' } print $0; prev_1 = $1};' | awk 'BEGIN { RS=""} $0 ~ /type\.object\.type.*/film\.film>/' > freebase-films.txt

It will give you all the triplets for all subjects that has the type film. (it assumes all subjects come in sorted order)

After this you can simply grep for the predicates that you need.

Yevhen Tienkaiev · Answer

Just one remark for accepted post, variant for topics don't work for me, because if we want use regex we need to set -E parameter

zgrep -E '\s<http://rdf\.freebase\.com/ns/(type\.object|common\.topic)' freebase-rdf-{date}.gz | gzip > freebase-topics.gz

Extracting a subset data of Freebase for faster development iteration

Tags:

freebase

nizam.sp

3 Answers

Shawn Simister

Fredrik

Yevhen Tienkaiev

Recent Activity

Donate For Us

Extracting a subset data of Freebase for faster development iteration

Tags:

freebase

nizam.sp

3 Answers

Shawn Simister

Fredrik

Yevhen Tienkaiev

Related questions

Recent Activity

Donate For Us