Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to read and write Parquet using Java without a dependency on Hadoop and HDFS?

I've been hunting around for a solution to this question.

It appears to me that there is no way to embed reading and writing Parquet format in a Java program without pulling in dependencies on HDFS and Hadoop. Is this correct?

I want to read and write on a client machine, outside of a Hadoop cluster.

I started to get excited about Apache Drill, but it appears that it must run as a separate process. What I need is an in-process ability to read and write a file using the Parquet format.

like image 267
Jesse Avatar asked Feb 06 '17 22:02

Jesse


People also ask

Can Java read Parquet file?

This project provides a library that reads Parquet files into Java objects.

Is Parquet only for Hadoop?

You don't need to have HDFS/Hadoop for consuming Parquet file. There are different ways to consume Parquet. You could access it using Apache Spark.

How is Parquet stored in HDFS?

Parquet stores binary data in a column-oriented way, where the values of each column are organized so that they are all adjacent, enabling better compression. It is especially good for queries which read particular columns from a “wide” (with many columns) table, since only needed columns are read and IO is minimized.


1 Answers

You can write parquet format out side hadoop cluster using java Parquet Client API.

Here is a sample code in java which writes parquet format to local disk.

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroSchemaConverter;
import org.apache.parquet.avro.AvroWriteSupport;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;
import org.apache.parquet.schema.MessageType;

public class Test {
    void test() throws IOException {
        final String schemaLocation = "/tmp/avro_format.json";
        final Schema avroSchema = new Schema.Parser().parse(new File(schemaLocation));
        final MessageType parquetSchema = new AvroSchemaConverter().convert(avroSchema);
        final WriteSupport<Pojo> writeSupport = new AvroWriteSupport(parquetSchema, avroSchema);
        final String parquetFile = "/tmp/parquet/data.parquet";
        final Path path = new Path(parquetFile);
        ParquetWriter<GenericRecord> parquetWriter = new ParquetWriter(path, writeSupport, CompressionCodecName.SNAPPY, BLOCK_SIZE, PAGE_SIZE);
        final GenericRecord record = new GenericData.Record(avroSchema);
        record.put("id", 1);
        record.put("age", 10);
        record.put("name", "ABC");
        record.put("place", "BCD");
        parquetWriter.write(record);
        parquetWriter.close();
    }
}

avro_format.json,

{
   "type":"record",
   "name":"Pojo",
   "namespace":"com.xx.test",
   "fields":[
      {
         "name":"id",
         "type":[
            "int",
            "null"
         ]
      },
      {
         "name":"age",
         "type":[
            "int",
            "null"
         ]
      },
      {
         "name":"name",
         "type":[
            "string",
            "null"
         ]
      },
      {
         "name":"place",
         "type":[
            "string",
            "null"
         ]
      }
   ]
}

Hope this helps.

like image 121
Krishas Avatar answered Oct 27 '22 11:10

Krishas