Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storage format in HDFS

How Does HDFS store data?

I want to store huge files in a compressed fashion.

E.g : I have a 1.5 GB of file, with default replication factor of 3.

It requires (1.5)*3 = 4.5 GB of space.

I believe currently no implicit compression of data takes place.

Is there a technique to compress the file and store it in HDFS to save disk space ?

like image 481
Uno Avatar asked Jun 01 '12 22:06

Uno


1 Answers

HDFS stores any file in a number of 'blocks'. The block size is configurable on a per file basis, but has a default value (like 64/128/256 MB)

So given a file of 1.5 GB, and block size of 128 MB, hadoop would break up the file into ~12 blocks (12 x 128 MB ~= 1.5GB). Each block is also replicated a configurable number of times.

If your data compresses well (like text files) then you can compress the files and store the compressed files in HDFS - the same applies as above, so if the 1.5GB file compresses to 500MB, then this would be stored as 4 blocks.

However, one thing to consider when using compression is whether the compression method supports splitting the file - that is can you randomly seek to a position in the file and recover the compressed stream (GZIp for example does not support splitting, BZip2 does).

Even if the method doesn't support splitting, hadoop will still store the file in a number of blocks, but you'll lose some benefit of 'data locality' as the blocks will most probably be spread around your cluster.

In your map reduce code, Hadoop has a number of compression codecs installed by default, and will automatically recognize certain file extensions (.gz for GZip files for example), abstracting you away from worrying about whether the input / output needs to be compressed.

Hope this makes sense

EDIT Some additional info in response to comments:

When writing to HDFS as output from a Map Reduce job, see the API for FileOutputFormat, in particular the following methods:

  • setCompressOutput(Job, boolean)
  • setOutputCompressorClass(Job, Class)

When uploading files to HDFS, yes they should be pre-compressed, and with the associated file extension for that compression type (out of the box, hadoop supports gzip with the .gz extension, so file.txt.gz would denote a gzipped file)

like image 146
Chris White Avatar answered Oct 20 '22 13:10

Chris White