Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala)

I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any sample code or examples would be helpful. As far as what I have found to convert the messages to Parquet either Hive, Pig, Spark are being used. I need to convert to Parquet without involving these only by Java.

like image 505
vijju Avatar asked Oct 04 '16 17:10

vijju


People also ask

Can JSON be converted to Parquet?

You can use Coiled, the cloud-based Dask platform, to easily convert large JSON data into a tabular DataFrame stored as Parquet in a cloud object-store.

What is the difference between JSON and Parquet?

JSON is the standard for communicating on the web. APIs and websites are constantly communicating using JSON because of its usability properties such as well-defined schemas. Parquet is optimized for the Write Once Read Many (WORM) paradigm.

Does Parquet support nested JSON?

You can use the serialization to inspect, convert, and ingest nested data as JSON with Redshift Spectrum. This method is supported for ORC, JSON, Ion, and Parquet formats.

How do I convert Parquet to JSON in Pyspark?

csv("path")" function. The JSON file is converted to Parquet file using the "spark. write. parquet()" function, and it is written to Spark DataFrame to Parquet file, and parquet() function is provided in the DataFrameWriter class.


1 Answers

To convert JSON data files to Parquet, you need some in-memory representation. Parquet doesn't have its own set of Java objects; instead, it reuses the objects from other formats, like Avro and Thrift. The idea is that Parquet works natively with the objects your applications probably already use.

To convert your JSON, you need to convert the records to Avro in-memory objects and pass those to Parquet, but you don't need to convert a file to Avro and then to Parquet.

Conversion to Avro objects is already done for you, see Kite's JsonUtil, and is ready to use as a file reader. The conversion method needs an Avro schema, but you can use that same library to infer an Avro schema from JSON data.

To write those records, you just need to use ParquetAvroWriter. The whole setup looks like this:

Schema jsonSchema = JsonUtil.inferSchema(fs.open(source), "RecordName", 20);
try (JSONFileReader<Record> reader = new JSONFileReader<>(
                    fs.open(source), jsonSchema, Record.class)) {

  reader.initialize();

  try (ParquetWriter<Record> writer = AvroParquetWriter
      .<Record>builder(outputPath)
      .withConf(new Configuration)
      .withCompressionCodec(CompressionCodecName.SNAPPY)
      .withSchema(jsonSchema)
      .build()) {
    for (Record record : reader) {
      writer.write(record);
    }
  }
}
like image 132
blue Avatar answered Sep 30 '22 18:09

blue