Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write avro to multiple output directory using spark

Hi,There is a topic about writing text data into multiple output directories in one spark job using MultipleTextOutputFormat

Write to multiple outputs by key Spark - one Spark job

I would ask if there is some similar way to write avro data to multiple directories

What I want is to write the data in avro file to different directory(based on the timestamp field, same day in the timestamp goes to the same directory)

like image 982
Tom Avatar asked Oct 30 '22 16:10

Tom


1 Answers

The AvroMultipleOutputs class simplifies writing Avro output data to multiple outputs.

  • Case one: writing to additional outputs other than the job default output. Each additional output, or named output, may be configured with its own Schema and OutputFormat.

  • Case two: to write data to different files provided by user

AvroMultipleOutputs supports counters, by default they are disabled. The counters group is the AvroMultipleOutputs class name. The names of the counters are the same as the output name. These count the number of records written to each output name.

Also have a look at

  • MultipleOutputer
  • MultipleOutputsFormatTest (see the code example with unit test case here... For some reason MultipleOutputs does not work with Avro, but the near-identical AvroMultipleOutputs does. These obviously related classes have no common ancestor so they are combined under the MultipleOutputer type class which at least allows for future extension.)
like image 141
Ram Ghadiyaram Avatar answered Nov 15 '22 10:11

Ram Ghadiyaram