How to avoid creation of .crc files when parquet files are created

Tags:

parquet

I am using parquet framework to write parquet files. I create the parquet writer with this constructor--

public class ParquetBaseWriter<T extends HashMap> extends ParquetWriter<T> {
    public ParquetBaseWriter(Path file, HashMap<String, SchemaField> mySchema,
                             CompressionCodecName compressionCodecName, int blockSize,
                             int pageSize) throws IOException {
        super(file, ParquetBaseWriter.<T>writeSupport(mySchema),
                compressionCodecName, blockSize, pageSize, DEFAULT_IS_DICTIONARY_ENABLED, false);
    }

Eachtime a parquet file is created, A .crc file corresponding to it also gets created on the disk. How can I avoid the creation of that .crc file? Is there a flag or something which I have to set?

Thanks

312

asked Oct 13 '14 06:10

1 Answers

You could see this google groups discussion about the crc files: https://groups.google.com/a/cloudera.org/forum/#!topic/cdk-dev/JR45MsLeyTE

TL;DR - crc files don't take up any overhead in the NN namespace. They're not HDFS data files, they are meta files in the data directories. You will see them in your local filesystem if you use the "file:///" URI.

answered Oct 06 '22 15:10

Idan Fischman

Related questions
                            
                                How do I read only part of a column from a Parquet file using Parquet.net?
                            
                                Saving a >>25T SchemaRDD in Parquet format on S3
                            
                                Project_Bank.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [110, 111, 13, 10]
                            
                                How to write TIMESTAMP logical type (INT96) to parquet, using ParquetWriter?
                            
                                Spark Exception : Task failed while writing rows
                            
                                Spark 2.3+ use of parquet.enable.dictionary?
                            
                                Spark with Avro, Kryo and Parquet
                            
                                pandas write dataframe to parquet format with append
                            
                                Repartition Dask DataFrame to get even partitions
                            
                                Why are Spark Parquet files for an aggregate larger than the original?
                            
                                Pandas DataFrame with categorical columns from a Parquet file using read_parquet?
                            
                                Unable to read a parquet file
                            
                                How to convert Numpy to Parquet without using Pandas?
                            
                                Does Google BigQuery supports Parquet file format?
                            
                                Spark Parquet Statistics(min/max) integration
                            
                                Can python fastparquet module read in compressed parquet file?
                            
                                Serialize parquet data with C#
                            
                                Spark SQL unable to complete writing Parquet data with a large number of shards
                            
                                Using hive table over parquet in Pig

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to avoid creation of .crc files when parquet files are created

Tags:

parquet

Neha

People also ask

1 Answers

Idan Fischman

Recent Activity

Donate For Us