Storage format in HDFS

Question

How Does HDFS store data?

I want to store huge files in a compressed fashion.

E.g : I have a 1.5 GB of file, with default replication factor of 3.

It requires (1.5)*3 = 4.5 GB of space.

I believe currently no implicit compression of data takes place.

Is there a technique to compress the file and store it in HDFS to save disk space ?

Chris White · Accepted Answer

HDFS stores any file in a number of 'blocks'. The block size is configurable on a per file basis, but has a default value (like 64/128/256 MB)

So given a file of 1.5 GB, and block size of 128 MB, hadoop would break up the file into ~12 blocks (12 x 128 MB ~= 1.5GB). Each block is also replicated a configurable number of times.

If your data compresses well (like text files) then you can compress the files and store the compressed files in HDFS - the same applies as above, so if the 1.5GB file compresses to 500MB, then this would be stored as 4 blocks.

However, one thing to consider when using compression is whether the compression method supports splitting the file - that is can you randomly seek to a position in the file and recover the compressed stream (GZIp for example does not support splitting, BZip2 does).

Even if the method doesn't support splitting, hadoop will still store the file in a number of blocks, but you'll lose some benefit of 'data locality' as the blocks will most probably be spread around your cluster.

In your map reduce code, Hadoop has a number of compression codecs installed by default, and will automatically recognize certain file extensions (.gz for GZip files for example), abstracting you away from worrying about whether the input / output needs to be compressed.

Hope this makes sense

EDIT Some additional info in response to comments:

When writing to HDFS as output from a Map Reduce job, see the API for FileOutputFormat, in particular the following methods:

setCompressOutput(Job, boolean)
setOutputCompressorClass(Job, Class)

When uploading files to HDFS, yes they should be pre-compressed, and with the associated file extension for that compression type (out of the box, hadoop supports gzip with the .gz extension, so file.txt.gz would denote a gzipped file)

Storage format in HDFS

Tags:

storage

hadoop

hdfs

Uno

1 Answers

Chris White

Recent Activity

Donate For Us

Storage format in HDFS

Tags:

storage

hadoop

hdfs

Uno

1 Answers

Chris White

Related questions

Recent Activity

Donate For Us