Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid creation of .crc files when parquet files are created

Tags:

parquet

I am using parquet framework to write parquet files. I create the parquet writer with this constructor--

public class ParquetBaseWriter<T extends HashMap> extends ParquetWriter<T> {
    public ParquetBaseWriter(Path file, HashMap<String, SchemaField> mySchema,
                             CompressionCodecName compressionCodecName, int blockSize,
                             int pageSize) throws IOException {
        super(file, ParquetBaseWriter.<T>writeSupport(mySchema),
                compressionCodecName, blockSize, pageSize, DEFAULT_IS_DICTIONARY_ENABLED, false);
    }

Eachtime a parquet file is created, A .crc file corresponding to it also gets created on the disk. How can I avoid the creation of that .crc file? Is there a flag or something which I have to set?

Thanks

like image 312
Neha Avatar asked Oct 13 '14 06:10

Neha


People also ask

What is .CRC file in Spark?

The question is why we need CRC and _SUCCESS files? Spark (worker) nodes write data simultaneously and these files act as checksum for validation. Writing to a single file takes away the idea of distributed computing and this approach may fail if your resultant file is too large.

Is Parquet already compressed?

If you've read about Parquet format, you learn that Parquet is already some cool smart compression and encoding of your data by employing delta encoding, run-length encoding, dictionary encoding etc.

How are Parquet files created?

Parquet files are composed of row groups, header and footer. Each row group contains data from the same columns. The same columns are stored together in each row group: This structure is well-optimized both for fast query performance, as well as low I/O (minimizing the amount of data scanned).

Is Parquet more compressed than CSV?

Parquet files take much less disk space than CSVs (column Size on Amazon S3) and are faster to scan (column Data Scanned).


1 Answers

You could see this google groups discussion about the crc files: https://groups.google.com/a/cloudera.org/forum/#!topic/cdk-dev/JR45MsLeyTE

TL;DR - crc files don't take up any overhead in the NN namespace. They're not HDFS data files, they are meta files in the data directories. You will see them in your local filesystem if you use the "file:///" URI.

like image 62
Idan Fischman Avatar answered Oct 06 '22 15:10

Idan Fischman