Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is Snappy splittable or not splittable?

Tags:

hadoop

snappy

According to this Cloudera post, Snappy IS splittable.

For MapReduce, if you need your compressed data to be splittable, BZip2, LZO, and Snappy formats are splittable, but GZip is not. Splittability is not relevant to HBase data.

But from the hadoop definitive guide, Snappy is NOT splittable. enter image description here

There are also some confilitcting information on the web. Some say it's splittable, some say it's not.

like image 495
moon Avatar asked Sep 03 '15 17:09

moon


People also ask

Which compression format is Splittable?

BZIP2 is splittable in hadoop - it provides very good compression ratio but from CPU time and performances is not providing optimal results, as compression is very CPU consuming. LZO is splittable in hadoop - leveraging hadoop-lzo you have splittable compressed LZO files. You need to have external .

Is snappy better than GZip?

GZIP compresses data 30% more as compared to Snappy and 2x more CPU when reading GZIP data compared to one that is consuming Snappy data. LZO focus on decompression speed at low CPU usage and higher compression at the cost of more CPU. For longer term/static storage, the GZip compression is still better.

What is snappy compression Parquet?

By default Big SQL will use SNAPPY compression when writing into Parquet tables. This means that if data is loaded into Big SQL using either the LOAD HADOOP or INSERT… SELECT commands, then SNAPPY compression is enabled by default.

Which compression is best in Hive?

The snappy5compressed RC5file gave the best compression ratio (less than 2/5 that of any of the uncompressed formats) as well as the least disk space usage (figure 8).


2 Answers

Both are correct but in different levels.

According with Cloudera blog http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

One thing to note is that Snappy is intended to be used with a
container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce. This is different to LZO, where is is possible to index LZO compressed files to determine split points so that LZO files can be processed efficiently in subsequent processing.

This means that if a whole text file is compressed with Snappy then the file is NOT splittable. But if each record inside the file is compressed with Snappy then the file could be splittable, for example in Sequence files with block compression.

To be more clear, is not the same:

<START-FILE>
  <START-SNAPPY-BLOCK>
     FULL CONTENT
  <END-SNAPPY-BLOCK>
<END-FILE>

than

<START-FILE>
  <START-SNAPPY-BLOCK1>
     RECORD1
  <END-SNAPPY-BLOCK1>
  <START-SNAPPY-BLOCK2>
     RECORD2
  <END-SNAPPY-BLOCK2>
  <START-SNAPPY-BLOCK3>
     RECORD3
  <END-SNAPPY-BLOCK3>
<END-FILE>

Snappy blocks are NOT splittable but files with snappy blocks are splittables.

like image 188
RojoSam Avatar answered Sep 20 '22 00:09

RojoSam


All splittable codecs in hadoop must implement org.apache.hadoop.io.compress.SplittableCompressionCodec. Looking at the hadoop source code as of 2.7, we see org.apache.hadoop.io.compress.SnappyCodec does not implement this interface, so we know it is not splittable.

like image 32
qwwqwwq Avatar answered Sep 21 '22 00:09

qwwqwwq