Is there a way to create parquet files from java?
I have data in memory (java classes) and I want to write it into a parquet file, to later read it from apache-drill.
Is there an simple way to do this, like inserting data into a sql table?
GOT IT
Thanks for the help.
Combining the answers and this link, I was able to create a parquet file and read it back with drill.
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...
Each block in the parquet file is stored in the form of row groups. So, data in a parquet file is partitioned into multiple row groups. These row groups in turn consists of one or more column chunks which corresponds to a column in the dataset. The data for each column chunk is then written in the form of pages.
Parquet files are very fast to load and save data. we can't read them manually like how we can read CSV, it is because it is stored in a columnar manner.
ParquetWriter's constructors are deprecated(1.8.1) but not ParquetWriter itself, you can still create ParquetWriter by extending abstract Builder subclass inside of it.
Here an example from parquet creators themselves ExampleParquetWriter:
public static class Builder extends ParquetWriter.Builder<Group, Builder> {
private MessageType type = null;
private Map<String, String> extraMetaData = new HashMap<String, String>();
private Builder(Path file) {
super(file);
}
public Builder withType(MessageType type) {
this.type = type;
return this;
}
public Builder withExtraMetaData(Map<String, String> extraMetaData) {
this.extraMetaData = extraMetaData;
return this;
}
@Override
protected Builder self() {
return this;
}
@Override
protected WriteSupport<Group> getWriteSupport(Configuration conf) {
return new GroupWriteSupport(type, extraMetaData);
}
}
If you don't want to use Group and GroupWriteSupport(bundled in Parquet but purposed just as an example of data-model implementation) you can go with Avro, Protocol Buffers, or Thrift in-memory data models. Here is an example using writing Parquet using Avro:
try (ParquetWriter<GenericData.Record> writer = AvroParquetWriter
.<GenericData.Record>builder(fileToWrite)
.withSchema(schema)
.withConf(new Configuration())
.withCompressionCodec(CompressionCodecName.SNAPPY)
.build()) {
for (GenericData.Record record : recordsToWrite) {
writer.write(record);
}
}
You will need these dependencies:
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-hadoop</artifactId>
<version>1.8.1</version>
</dependency>
Full example here.
A few possible ways to do it:
text_table
. See Impala's Create Table Statement for details.parquet_table
.insert into parquet_table select * from text_table
to copy all data from the text file to the parquet table.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With