Read parquet data from AWS s3 bucket

Tags:

I need read parquet data from aws s3. If I use aws sdk for this I can get inputstream like this:

S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, bucketKey));
InputStream inputStream = object.getObjectContent();

But the apache parquet reader uses only local file like this:

ParquetReader<Group> reader =
                    ParquetReader.builder(new GroupReadSupport(), new Path(file.getAbsolutePath()))
                            .withConf(conf)
                            .build();
reader.read()

So I don't know how parse input stream for parquet file. For example for csv files there is CSVParser which uses inputstream.

I know solution to use spark for this goal. Like this:

SparkSession spark = SparkSession
                .builder()
                .getOrCreate();
Dataset<Row> ds = spark.read().parquet("s3a://bucketName/file.parquet");

But I cannot use spark.

Could anyone tell me any solutions for read parquet data from s3?

821

asked Oct 19 '17 13:10

Alexander

1 Answers

String SCHEMA_TEMPLATE = "{" +
                        "\"type\": \"record\",\n" +
                        "    \"name\": \"schema\",\n" +
                        "    \"fields\": [\n" +
                        "        {\"name\": \"timeStamp\", \"type\": \"string\"},\n" +
                        "        {\"name\": \"temperature\", \"type\": \"double\"},\n" +
                        "        {\"name\": \"pressure\", \"type\": \"double\"}\n" +
                        "    ]" +
                        "}";
String PATH_SCHEMA = "s3a";
Path internalPath = new Path(PATH_SCHEMA, bucketName, folderName);
Schema schema = new Schema.Parser().parse(SCHEMA_TEMPLATE);
Configuration configuration = new Configuration();
AvroReadSupport.setRequestedProjection(configuration, schema);
ParquetReader<GenericRecord> = AvroParquetReader.GenericRecord>builder(internalPath).withConf(configuration).build();
GenericRecord genericRecord = parquetReader.read();

while(genericRecord != null) {
        Map<String, String> valuesMap = new HashMap<>();
        genericRecord.getSchema().getFields().forEach(field -> valuesMap.put(field.name(), genericRecord.get(field.name()).toString()));

        genericRecord = parquetReader.read();
}

Gradle dependencies

    compile 'com.amazonaws:aws-java-sdk:1.11.213'
    compile 'org.apache.parquet:parquet-avro:1.9.0'
    compile 'org.apache.parquet:parquet-hadoop:1.9.0'
    compile 'org.apache.hadoop:hadoop-common:2.8.1'
    compile 'org.apache.hadoop:hadoop-aws:2.8.1'
    compile 'org.apache.hadoop:hadoop-client:2.8.1'

157

answered Oct 18 '22 03:10

Alexander

Related questions
                            
                                Google App Engine: Backend vs Frontend Instances
                            
                                What on earth is "Self-suppression not permitted" and why is Javac generating code which results in this error?
                            
                                LDAP query doesn't return correct data from Active Directory
                            
                                Maven builds on file changes
                            
                                Slow performance on Hibernate + Java but fast when I use TOAD with the same native Oracle query
                            
                                Java catch block, caught exception is not final
                            
                                REST with JAX-RS - Handling long running operations
                            
                                Java plotting library like python's matplotlib [closed]
                            
                                Spring MVC how to forbid data binding to ModelAttribute?
                            
                                How to instantiate recursive bound generics with inner class in Java?
                            
                                External library folder for Spring Boot
                            
                                Different behavior running and debugging the program Java, Eclipse
                            
                                Iterating files in scala/java in O(1) open file descriptors
                            
                                How is the component type for the varargs array determined?
                            
                                Hibernate: Error accessing field [private java.lang.Integer ] by reflection for persistent property
                            
                                Chrome driver stopped working for Chrome browser version 54 with the latest Chrome driver 2.24
                            
                                Occasional NullPointerException in ResultSetImpl.checkColumnBounds or ResultSetImpl.getStringInternal
                            
                                How do you retain Spring's Built-In REST Response JSON body with a @ControllerAdvise (and @RestControllerAdvise) Class?
                            
                                Java memory model - volatile and x86
                            
                                Unable to delete old javaCompile action, maybe the class name has changed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Read parquet data from AWS s3 bucket

Tags:

java

amazon-web-services

amazon-s3

parquet

Alexander

People also ask

1 Answers

Alexander

Recent Activity

Donate For Us