Is there a way to create parquet files from java? I have data in memory (java classes) and I want to write it into a parquet file, to later read it from apache-drill. Is there an simple way to do this, like inserting data into a sql table? GOT IT Thanks for the help. Combining the answers and this link, I was able to create a parquet file and read it back with drill.

A few possible ways to do it: <ul> <li>Use the Java Parquet library to write Parquet directly from your code.</li> <li>Connect to Hive or Impala using JDBC and insert the data using SQL. Please note that if you insert rows one by one it will result in separate files for each individual record and will totally ruin the performance. You should insert lots of rows at once, which is not trivial, so I don't recommend this approach.</li> <li>Save the data to a delimited text file, then do the following steps in either Hive or Impala: <ul> <li>Define a table over the text file to allow Hive/Impala to read the data. Let's call this table <code>text_table</code>. See Impala's Create Table Statement for details.</li> <li>Create a new table with identical columns but specifying Parquet as its file format. Let's call this table <code>parquet_table</code>.</li> <li>Finally do an <code>insert into parquet_table select * from text_table</code> to copy all data from the text file to the parquet table.</li> </ul> </li> </ul>

create parquet files in java

2 Answers

ParquetWriter's constructors are deprecated(1.8.1) but not ParquetWriter itself, you can still create ParquetWriter by extending abstract Builder subclass inside of it.

Here an example from parquet creators themselves ExampleParquetWriter:

  public static class Builder extends ParquetWriter.Builder<Group, Builder> {
    private MessageType type = null;
    private Map<String, String> extraMetaData = new HashMap<String, String>();

    private Builder(Path file) {
      super(file);
    }

    public Builder withType(MessageType type) {
      this.type = type;
      return this;
    }

    public Builder withExtraMetaData(Map<String, String> extraMetaData) {
      this.extraMetaData = extraMetaData;
      return this;
    }

    @Override
    protected Builder self() {
      return this;
    }

    @Override
    protected WriteSupport<Group> getWriteSupport(Configuration conf) {
      return new GroupWriteSupport(type, extraMetaData);
    }

  }

If you don't want to use Group and GroupWriteSupport(bundled in Parquet but purposed just as an example of data-model implementation) you can go with Avro, Protocol Buffers, or Thrift in-memory data models. Here is an example using writing Parquet using Avro:

try (ParquetWriter<GenericData.Record> writer = AvroParquetWriter
        .<GenericData.Record>builder(fileToWrite)
        .withSchema(schema)
        .withConf(new Configuration())
        .withCompressionCodec(CompressionCodecName.SNAPPY)
        .build()) {
    for (GenericData.Record record : recordsToWrite) {
        writer.write(record);
    }
}

You will need these dependencies:

<dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-avro</artifactId>
    <version>1.8.1</version>
</dependency>

<dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-hadoop</artifactId>
    <version>1.8.1</version>
</dependency>

Full example here.

170

answered Oct 11 '22 18:10

MaxNevermind

A few possible ways to do it:

Use the Java Parquet library to write Parquet directly from your code.
Connect to Hive or Impala using JDBC and insert the data using SQL. Please note that if you insert rows one by one it will result in separate files for each individual record and will totally ruin the performance. You should insert lots of rows at once, which is not trivial, so I don't recommend this approach.
Save the data to a delimited text file, then do the following steps in either Hive or Impala:
- Define a table over the text file to allow Hive/Impala to read the data. Let's call this table text_table. See Impala's Create Table Statement for details.
- Create a new table with identical columns but specifying Parquet as its file format. Let's call this table parquet_table.
- Finally do an insert into parquet_table select * from text_table to copy all data from the text file to the parquet table.

answered Oct 11 '22 18:10

Zoltan

Related questions
                            
                                Java: How to check if an object is an instance of a non-static inner class, regardless of the outer object?
                            
                                JavaFX Tableview with FilteredList (JDK 8) does not sort by column
                            
                                Retrofit and Jackson and parsing JSON
                            
                                OpenCV enum variables(like CV_BGR2GRAY or CV_AA) missing in the recent Java API?
                            
                                How to validate XML against XSD 1.1 in Java?
                            
                                Serialize an object with no data in Jackson
                            
                                Spring Bean property 'xxx' is not writable or has an invalid setter method
                            
                                Selectively expand associations in Spring Data Rest response
                            
                                What does it mean when I say Prepared statement is pre-compiled?
                            
                                Check if bigdecimal has only 2 digits after precision using @javax.validation.constraints.Digits
                            
                                Right exception to throw for the lack of a system property
                            
                                How to convert an Instant to a LocalTime?
                            
                                Android Studio - Gradle sync error
                            
                                Where does the lib directory for unmanaged jars in sbt directory structure go?
                            
                                Understanding Java FixedThreadPool
                            
                                SLF4J error: class loader have different class objects for the type
                            
                                The import javafx.scene.control.Alert cannot be resolved
                            
                                How to generate offline Swagger API docs?
                            
                                AuthenticationSuccessHandler in Spring Security
                            
                                Enable CORS for OPTIONS request using Spring Framework

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

create parquet files in java

Tags:

java

parquet

Imbar M.

People also ask

2 Answers

MaxNevermind

Zoltan

Recent Activity

Donate For Us