Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala)

Tags:

I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any sample code or examples would be helpful. As far as what I have found to convert the messages to Parquet either Hive, Pig, Spark are being used. I need to convert to Parquet without involving these only by Java.

505

asked Oct 04 '16 17:10

vijju

1 Answers

To convert JSON data files to Parquet, you need some in-memory representation. Parquet doesn't have its own set of Java objects; instead, it reuses the objects from other formats, like Avro and Thrift. The idea is that Parquet works natively with the objects your applications probably already use.

To convert your JSON, you need to convert the records to Avro in-memory objects and pass those to Parquet, but you don't need to convert a file to Avro and then to Parquet.

Conversion to Avro objects is already done for you, see Kite's JsonUtil, and is ready to use as a file reader. The conversion method needs an Avro schema, but you can use that same library to infer an Avro schema from JSON data.

To write those records, you just need to use ParquetAvroWriter. The whole setup looks like this:

Schema jsonSchema = JsonUtil.inferSchema(fs.open(source), "RecordName", 20);
try (JSONFileReader<Record> reader = new JSONFileReader<>(
                    fs.open(source), jsonSchema, Record.class)) {

  reader.initialize();

  try (ParquetWriter<Record> writer = AvroParquetWriter
      .<Record>builder(outputPath)
      .withConf(new Configuration)
      .withCompressionCodec(CompressionCodecName.SNAPPY)
      .withSchema(jsonSchema)
      .build()) {
    for (Record record : reader) {
      writer.write(record);
    }
  }
}

132

answered Sep 30 '22 18:09

blue

Related questions
                            
                                Why does Java limit the size of a method to 65535 byte?
                            
                                Validate X.509 certificate against CA in Java
                            
                                How to extract separate text nodes with Jsoup?
                            
                                Cannot make a static reference to the non-static field memberVariable with private variable
                            
                                Use custom fonts when creating pdf using iReport
                            
                                Why is this enum code an illegal reference to a static field?
                            
                                Quickly query a table if it contains a key (DynamoDB and Java)
                            
                                Inserting into custom SQL types with prepared statements in java
                            
                                Mockito: when Method A.a is called then execute B.b
                            
                                Code coverage in Java with EclEmma not scanning expecting exception methods
                            
                                How to set a JVM option in Jenkins globally for every job?
                            
                                Get variable by name from a String
                            
                                "offset or count might be near -1>>>1." What does it mean
                            
                                Android - Running a background task every 15 minutes, even when application is not running
                            
                                How are exceptions caught and dealt with at the low (assembly) level?
                            
                                GC behavior when assigning null to reference variable
                            
                                java 8 - stream, map and count distinct
                            
                                Casting rules for primitive types in java
                            
                                Are channel/stubs in gRPC thread-safe
                            
                                Merge Map<String, List<String> Java 8 Stream

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala)

Tags:

java

json

hadoop

parquet

vijju

People also ask

1 Answers

blue

Recent Activity

Donate For Us