I am using parquet framework to write parquet files. I create the parquet writer with this constructor--
public class ParquetBaseWriter<T extends HashMap> extends ParquetWriter<T> {
public ParquetBaseWriter(Path file, HashMap<String, SchemaField> mySchema,
CompressionCodecName compressionCodecName, int blockSize,
int pageSize) throws IOException {
super(file, ParquetBaseWriter.<T>writeSupport(mySchema),
compressionCodecName, blockSize, pageSize, DEFAULT_IS_DICTIONARY_ENABLED, false);
}
Eachtime a parquet file is created, A .crc file corresponding to it also gets created on the disk. How can I avoid the creation of that .crc file? Is there a flag or something which I have to set?
Thanks
The question is why we need CRC and _SUCCESS files? Spark (worker) nodes write data simultaneously and these files act as checksum for validation. Writing to a single file takes away the idea of distributed computing and this approach may fail if your resultant file is too large.
If you've read about Parquet format, you learn that Parquet is already some cool smart compression and encoding of your data by employing delta encoding, run-length encoding, dictionary encoding etc.
Parquet files are composed of row groups, header and footer. Each row group contains data from the same columns. The same columns are stored together in each row group: This structure is well-optimized both for fast query performance, as well as low I/O (minimizing the amount of data scanned).
Parquet files take much less disk space than CSVs (column Size on Amazon S3) and are faster to scan (column Data Scanned).
You could see this google groups discussion about the crc files: https://groups.google.com/a/cloudera.org/forum/#!topic/cdk-dev/JR45MsLeyTE
TL;DR - crc files don't take up any overhead in the NN namespace. They're not HDFS data files, they are meta files in the data directories. You will see them in your local filesystem if you use the "file:///" URI.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With