Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create parquet files in java

Tags:

java

parquet

Is there a way to create parquet files from java?

I have data in memory (java classes) and I want to write it into a parquet file, to later read it from apache-drill.

Is there an simple way to do this, like inserting data into a sql table?

GOT IT

Thanks for the help.

Combining the answers and this link, I was able to create a parquet file and read it back with drill.

like image 434
Imbar M. Avatar asked Sep 27 '16 15:09

Imbar M.


People also ask

What is Parquet file in Java?

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...

How Parquet files are written?

Each block in the parquet file is stored in the form of row groups. So, data in a parquet file is partitioned into multiple row groups. These row groups in turn consists of one or more column chunks which corresponds to a column in the dataset. The data for each column chunk is then written in the form of pages.

Which is faster Parquet or CSV?

Parquet files are very fast to load and save data. we can't read them manually like how we can read CSV, it is because it is stored in a columnar manner.


2 Answers

ParquetWriter's constructors are deprecated(1.8.1) but not ParquetWriter itself, you can still create ParquetWriter by extending abstract Builder subclass inside of it.

Here an example from parquet creators themselves ExampleParquetWriter:

  public static class Builder extends ParquetWriter.Builder<Group, Builder> {
    private MessageType type = null;
    private Map<String, String> extraMetaData = new HashMap<String, String>();

    private Builder(Path file) {
      super(file);
    }

    public Builder withType(MessageType type) {
      this.type = type;
      return this;
    }

    public Builder withExtraMetaData(Map<String, String> extraMetaData) {
      this.extraMetaData = extraMetaData;
      return this;
    }

    @Override
    protected Builder self() {
      return this;
    }

    @Override
    protected WriteSupport<Group> getWriteSupport(Configuration conf) {
      return new GroupWriteSupport(type, extraMetaData);
    }

  }

If you don't want to use Group and GroupWriteSupport(bundled in Parquet but purposed just as an example of data-model implementation) you can go with Avro, Protocol Buffers, or Thrift in-memory data models. Here is an example using writing Parquet using Avro:

try (ParquetWriter<GenericData.Record> writer = AvroParquetWriter
        .<GenericData.Record>builder(fileToWrite)
        .withSchema(schema)
        .withConf(new Configuration())
        .withCompressionCodec(CompressionCodecName.SNAPPY)
        .build()) {
    for (GenericData.Record record : recordsToWrite) {
        writer.write(record);
    }
}   

You will need these dependencies:

<dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-avro</artifactId>
    <version>1.8.1</version>
</dependency>

<dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-hadoop</artifactId>
    <version>1.8.1</version>
</dependency>

Full example here.

like image 170
MaxNevermind Avatar answered Oct 11 '22 18:10

MaxNevermind


A few possible ways to do it:

  • Use the Java Parquet library to write Parquet directly from your code.
  • Connect to Hive or Impala using JDBC and insert the data using SQL. Please note that if you insert rows one by one it will result in separate files for each individual record and will totally ruin the performance. You should insert lots of rows at once, which is not trivial, so I don't recommend this approach.
  • Save the data to a delimited text file, then do the following steps in either Hive or Impala:
    • Define a table over the text file to allow Hive/Impala to read the data. Let's call this table text_table. See Impala's Create Table Statement for details.
    • Create a new table with identical columns but specifying Parquet as its file format. Let's call this table parquet_table.
    • Finally do an insert into parquet_table select * from text_table to copy all data from the text file to the parquet table.
like image 35
Zoltan Avatar answered Oct 11 '22 18:10

Zoltan