Currently, I am using the Apache ParquetReader for reading local parquet files, which looks something like this:
ParquetReader<GenericData.Record> reader = null;
Path path = new Path("userdata1.parquet");
try {
reader = AvroParquetReader.<GenericData.Record>builder(path).withConf(new Configuration()).build();
GenericData.Record record;
while ((record = reader.read()) != null) {
System.out.println(record);
However, I am trying to access a parquet file through S3 without downloading it. Is there a way to parse Inputstream directly with parquet reader?
You can also get Amazon S3 inventory reports in Parquet or ORC format. Amazon S3 inventory gives you a flat file list of your objects and metadata. You can get the S3 inventory for CSV, ORC or Parquet formats.
We can read parquet file in athena by creating a table for given s3 location. CREATE EXTERNAL TABLE abc_new_table ( dayofweek INT, flightdate STRING, uniquecarrier STRING, airlineid INT ) PARTITIONED BY (flightdate STRING) STORED AS PARQUET LOCATION 's3://abc_bucket/abc_folder/' tblproperties ("parquet.
With the query results stored in a DataFrame, we can use petl to extract, transform, and load the Parquet data. In this example, we extract Parquet data, sort the data by the Column1 column, and load the data into a CSV file.
Yes, the latest versions of hadoop include support for S3 filesystem. Use the s3a
client from hadoop-aws
library to directly access the S3 filesystem.
The HadoopInputFile
Path should be constructed as s3a://bucket-name/prefix/key
along with the authentication credentials access_key
and secret_key
configured using the properties
fs.s3a.access.key
fs.s3a.secret.key
Additionally, you would require these dependant libraries
hadoop-common
JARaws-java-sdk-bundle
JARRead more: Relevant configuration properties
I got it working with this following dependencies
compile 'org.slf4j:slf4j-api:1.7.5'
compile 'org.slf4j:slf4j-log4j12:1.7.5'
compile 'org.apache.parquet:parquet-avro:1.12.0'
compile 'org.apache.avro:avro:1.10.2'
compile 'com.google.guava:guava:11.0.2'
compile 'org.apache.hadoop:hadoop-client:2.4.0'
compile 'org.apache.hadoop:hadoop-aws:3.3.0'
compile 'org.apache.hadoop:hadoop-common:3.3.0'
compile 'com.amazonaws:aws-java-sdk-core:1.11.563'
compile 'com.amazonaws:aws-java-sdk-s3:1.11.563'
Example
Path path = new Path("s3a://yours3path");
Configuration conf = new Configuration();
conf.set("fs.s3a.access.key", "KEY");
conf.set("fs.s3a.secret.key", "SECRET");
conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
conf.setBoolean("fs.s3a.path.style.access", true);
conf.setBoolean(org.apache.parquet.avro.AvroReadSupport.READ_INT96_AS_FIXED, true);
InputFile file = HadoopInputFile.fromPath(path, conf);
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build();
GenericRecord record;
while ((record = reader.read()) != null) {
System.out.println(record);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With