Writing parquet data can be done with something like the following. But if I'm trying to write to more than just one file and moreover wanting to output to multiple s3 files so that reading a single column does not read all s3 data how can this be done? <pre class="prettyprint"><code> AvroParquetWriter<GenericRecord> writer = new AvroParquetWriter<GenericRecord>(file, schema); GenericData.Record record = new GenericRecordBuilder(schema) .set("name", "myname") .set("favorite_number", i) .set("favorite_color", "mystring").build(); writer.write(record); </code></pre> For example what if I want to partition by a column value so that all the data with favorite_color of red goes in one file and those with blue in another file to minimize the cost of certain queries. There should be something similar in a Hadoop context. All I can find are things that mention Spark using something like <pre class="prettyprint"><code>df.write.parquet("hdfs:///my_file", partitionBy=["created_year", "created_month"]) </code></pre> But I can find no equivalent to partitionBy in plain Java with Hadoop.

In a typical Map-Reduce application, the number of output files will be the same as the number of reduces in your job. So if you want multiple output files, set the number of reduces accordingly: <code>job.setNumReduceTasks(N);</code> or alternatively via the system property: <code>-Dmapreduce.job.reduces=N</code> I don't think it is possible to have one column per file with the Parquet format. The internal structure of Parquet files is initially split by row groups, and only these row groups are then split by columns. <img src="https://i.stack.imgur.com/Wa2Fm.jpg" alt="Parquet format">

How to output multiple s3 files in Parquet

Tags:

hadoop

parquet

Writing parquet data can be done with something like the following. But if I'm trying to write to more than just one file and moreover wanting to output to multiple s3 files so that reading a single column does not read all s3 data how can this be done?

    AvroParquetWriter<GenericRecord> writer =
            new AvroParquetWriter<GenericRecord>(file, schema);

    GenericData.Record record = new GenericRecordBuilder(schema)
                .set("name", "myname")
                .set("favorite_number", i)
                .set("favorite_color", "mystring").build();
    writer.write(record);

For example what if I want to partition by a column value so that all the data with favorite_color of red goes in one file and those with blue in another file to minimize the cost of certain queries. There should be something similar in a Hadoop context. All I can find are things that mention Spark using something like

df.write.parquet("hdfs:///my_file", partitionBy=["created_year", "created_month"])

But I can find no equivalent to partitionBy in plain Java with Hadoop.

860

asked Feb 04 '17 00:02

user782220

1 Answers

In a typical Map-Reduce application, the number of output files will be the same as the number of reduces in your job. So if you want multiple output files, set the number of reduces accordingly:

job.setNumReduceTasks(N);

or alternatively via the system property:

-Dmapreduce.job.reduces=N

I don't think it is possible to have one column per file with the Parquet format. The internal structure of Parquet files is initially split by row groups, and only these row groups are then split by columns.

Parquet format

192

answered Oct 09 '22 04:10

andresp

Related questions
                            
                                HBase Error - assignment of -ROOT- failure
                            
                                Hadoop: how to access (many) photo images to be processed by map/reduce?
                            
                                To change replication factor of a directory in hadoop
                            
                                Checksum verification in Hadoop
                            
                                copyFromLocal: unexpected URISyntaxException
                            
                                Apache Hive How to round off to 2 decimal places?
                            
                                Spark 1.6-Failed to locate the winutils binary in the hadoop binary path
                            
                                How to get file size
                            
                                Mapper input Key-Value pair in Hadoop
                            
                                Hadoop 2.2.0 : "name or service not known" Warning
                            
                                How to get ID of a map task in Spark?
                            
                                hadoop fs -du gives two data columns
                            
                                org.apache.hadoop.mapred.FileAlreadyExistsException
                            
                                error in namenode starting
                            
                                Hadoop YARN: Get a list of available queues
                            
                                How to connect to Hadoop/Hive from .NET
                            
                                Hive ParseException - cannot recognize input near 'end' 'string'
                            
                                How do you retrieve the replication factor info in Hdfs files?
                            
                                What is the difference between single node & pseudo-distributed mode in Hadoop?
                            
                                How to open/stream .zip files through Spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With