Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to output multiple s3 files in Parquet

Tags:

hadoop

parquet

Writing parquet data can be done with something like the following. But if I'm trying to write to more than just one file and moreover wanting to output to multiple s3 files so that reading a single column does not read all s3 data how can this be done?

    AvroParquetWriter<GenericRecord> writer =
            new AvroParquetWriter<GenericRecord>(file, schema);

    GenericData.Record record = new GenericRecordBuilder(schema)
                .set("name", "myname")
                .set("favorite_number", i)
                .set("favorite_color", "mystring").build();
    writer.write(record);

For example what if I want to partition by a column value so that all the data with favorite_color of red goes in one file and those with blue in another file to minimize the cost of certain queries. There should be something similar in a Hadoop context. All I can find are things that mention Spark using something like

df.write.parquet("hdfs:///my_file", partitionBy=["created_year", "created_month"])

But I can find no equivalent to partitionBy in plain Java with Hadoop.

like image 860
user782220 Avatar asked Feb 04 '17 00:02

user782220


People also ask

Does S3 support Parquet?

Amazon S3 inventory gives you a flat file list of your objects and metadata. You can get the S3 inventory for CSV, ORC or Parquet formats.

Can parquet file be split?

The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files.

What is parquet file in S3?

Parquet is a columnar storage file format, similar to ORC (optimized row-columnar) and is available to any project in the Hadoop ecosystem regardless of the choice of data processing framework, data model, or programming language.


1 Answers

In a typical Map-Reduce application, the number of output files will be the same as the number of reduces in your job. So if you want multiple output files, set the number of reduces accordingly:

job.setNumReduceTasks(N);

or alternatively via the system property:

-Dmapreduce.job.reduces=N

I don't think it is possible to have one column per file with the Parquet format. The internal structure of Parquet files is initially split by row groups, and only these row groups are then split by columns.

Parquet format

like image 192
andresp Avatar answered Oct 09 '22 04:10

andresp