Currently, I am using the Apache ParquetReader for reading local parquet files, which looks something like this: <pre class="prettyprint"><code>ParquetReader<GenericData.Record> reader = null; Path path = new Path("userdata1.parquet"); try { reader = AvroParquetReader.<GenericData.Record>builder(path).withConf(new Configuration()).build(); GenericData.Record record; while ((record = reader.read()) != null) { System.out.println(record); </code></pre> However, I am trying to access a parquet file through S3 without downloading it. Is there a way to parse Inputstream directly with parquet reader?

Yes, the latest versions of hadoop include support for S3 filesystem. Use the <code>s3a</code> client from <code>hadoop-aws</code> library to directly access the S3 filesystem. The <code>HadoopInputFile</code> Path should be constructed as <code>s3a://bucket-name/prefix/key</code> along with the authentication credentials <code>access_key</code> and <code>secret_key</code> configured using the properties <ul> <li><code>fs.s3a.access.key</code></li> <li><code>fs.s3a.secret.key</code></li> </ul> Additionally, you would require these dependant libraries <ul> <li> <code>hadoop-common</code> JAR</li> <li> <code>aws-java-sdk-bundle</code> JAR</li> </ul> Read more: Relevant configuration properties

I got it working with this following dependencies <pre class="prettyprint"><code>compile 'org.slf4j:slf4j-api:1.7.5' compile 'org.slf4j:slf4j-log4j12:1.7.5' compile 'org.apache.parquet:parquet-avro:1.12.0' compile 'org.apache.avro:avro:1.10.2' compile 'com.google.guava:guava:11.0.2' compile 'org.apache.hadoop:hadoop-client:2.4.0' compile 'org.apache.hadoop:hadoop-aws:3.3.0' compile 'org.apache.hadoop:hadoop-common:3.3.0' compile 'com.amazonaws:aws-java-sdk-core:1.11.563' compile 'com.amazonaws:aws-java-sdk-s3:1.11.563' </code></pre> Example <pre class="prettyprint"><code>Path path = new Path("s3a://yours3path"); Configuration conf = new Configuration(); conf.set("fs.s3a.access.key", "KEY"); conf.set("fs.s3a.secret.key", "SECRET"); conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"); conf.setBoolean("fs.s3a.path.style.access", true); conf.setBoolean(org.apache.parquet.avro.AvroReadSupport.READ_INT96_AS_FIXED, true); InputFile file = HadoopInputFile.fromPath(path, conf); ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build(); GenericRecord record; while ((record = reader.read()) != null) { System.out.println(record); } </code></pre>

How to read Parquet file from S3 without spark? Java

Tags:

java

amazon-s3

apache-spark

hadoop

parquet

Currently, I am using the Apache ParquetReader for reading local parquet files, which looks something like this:

ParquetReader<GenericData.Record> reader = null;
    Path path = new Path("userdata1.parquet");
    try {
        reader = AvroParquetReader.<GenericData.Record>builder(path).withConf(new Configuration()).build();
        GenericData.Record record;
        while ((record = reader.read()) != null) {
            System.out.println(record);

However, I am trying to access a parquet file through S3 without downloading it. Is there a way to parse Inputstream directly with parquet reader?

888

asked Apr 09 '20 17:04

Nicholas Liu

2 Answers

Yes, the latest versions of hadoop include support for S3 filesystem. Use the s3a client from hadoop-aws library to directly access the S3 filesystem.

The HadoopInputFile Path should be constructed as s3a://bucket-name/prefix/key along with the authentication credentials access_key and secret_key configured using the properties

fs.s3a.access.key
fs.s3a.secret.key

Additionally, you would require these dependant libraries

hadoop-common JAR
aws-java-sdk-bundle JAR

Read more: Relevant configuration properties

answered Sep 16 '22 12:09

franklinsijo

I got it working with this following dependencies

compile 'org.slf4j:slf4j-api:1.7.5'
compile 'org.slf4j:slf4j-log4j12:1.7.5'
compile 'org.apache.parquet:parquet-avro:1.12.0'
compile 'org.apache.avro:avro:1.10.2'
compile 'com.google.guava:guava:11.0.2'
compile 'org.apache.hadoop:hadoop-client:2.4.0'
compile 'org.apache.hadoop:hadoop-aws:3.3.0'   
compile 'org.apache.hadoop:hadoop-common:3.3.0'      
compile 'com.amazonaws:aws-java-sdk-core:1.11.563'
compile 'com.amazonaws:aws-java-sdk-s3:1.11.563'

Example

Path path = new Path("s3a://yours3path");
Configuration conf = new Configuration();
conf.set("fs.s3a.access.key", "KEY");
conf.set("fs.s3a.secret.key", "SECRET");
conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
conf.setBoolean("fs.s3a.path.style.access", true);
conf.setBoolean(org.apache.parquet.avro.AvroReadSupport.READ_INT96_AS_FIXED, true);

InputFile file = HadoopInputFile.fromPath(path, conf);
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build();
GenericRecord record;
while ((record = reader.read()) != null) {
  System.out.println(record);
}

answered Sep 16 '22 12:09

Sumon wai

Related questions
                            
                                How to cast a multidimensional array without knowing the dimension in Java
                            
                                Java redundant casts required in generic method
                            
                                Lucene Index upgrading from version 4.6 to 8.0.0
                            
                                Creating tables on demand with Spring Boot Data JPA
                            
                                How to properly publish DDD domain events with spring?
                            
                                Why can't the Java compiler infer Iterable<String> from the contraints Iterable<? extends CharSequence> and () -> (Iterator<String>)
                            
                                removing all non-printing characters by regex
                            
                                SonarLint throws IllegalStateException -> Failed to read local issue store index
                            
                                Check if strings in a list can be formed by concatenation of elements in the same list
                            
                                Why is my Spring Data JPA query 8 times slower than Node.JS + oracledb?
                            
                                How to typecast an object in the java Stream API?
                            
                                Netbeans Swing GUI Builder not working with java-module project and Maven
                            
                                Intersecting List with keys of Map
                            
                                Unable to get public no-arg constructor
                            
                                How to disable security in Quarkus
                            
                                Java8 Stream .orElseThrow unreported exception error
                            
                                How to get page size of pdf document iText 7
                            
                                RecyclerView set the last ArrayList item data to the first item
                            
                                SXSSFWorkbook.write to FileOutputStream writes huge files
                            
                                Cannot load library "opencv_java4" in android

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With