I need read parquet data from aws s3. If I use aws sdk for this I can get inputstream like this:
S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, bucketKey));
InputStream inputStream = object.getObjectContent();
But the apache parquet reader uses only local file like this:
ParquetReader<Group> reader =
ParquetReader.builder(new GroupReadSupport(), new Path(file.getAbsolutePath()))
.withConf(conf)
.build();
reader.read()
So I don't know how parse input stream for parquet file. For example for csv files there is CSVParser which uses inputstream.
I know solution to use spark for this goal. Like this:
SparkSession spark = SparkSession
.builder()
.getOrCreate();
Dataset<Row> ds = spark.read().parquet("s3a://bucketName/file.parquet");
But I cannot use spark.
Could anyone tell me any solutions for read parquet data from s3?
Amazon S3 inventory gives you a flat file list of your objects and metadata. You can get the S3 inventory for CSV, ORC or Parquet formats.
We can always read the parquet file to a dataframe in Spark and see the content. They are of columnar formats and are more suitable for analytical environments,write once and read many. Parquet files are more suitable for Read intensive applications.
String SCHEMA_TEMPLATE = "{" +
"\"type\": \"record\",\n" +
" \"name\": \"schema\",\n" +
" \"fields\": [\n" +
" {\"name\": \"timeStamp\", \"type\": \"string\"},\n" +
" {\"name\": \"temperature\", \"type\": \"double\"},\n" +
" {\"name\": \"pressure\", \"type\": \"double\"}\n" +
" ]" +
"}";
String PATH_SCHEMA = "s3a";
Path internalPath = new Path(PATH_SCHEMA, bucketName, folderName);
Schema schema = new Schema.Parser().parse(SCHEMA_TEMPLATE);
Configuration configuration = new Configuration();
AvroReadSupport.setRequestedProjection(configuration, schema);
ParquetReader<GenericRecord> = AvroParquetReader.GenericRecord>builder(internalPath).withConf(configuration).build();
GenericRecord genericRecord = parquetReader.read();
while(genericRecord != null) {
Map<String, String> valuesMap = new HashMap<>();
genericRecord.getSchema().getFields().forEach(field -> valuesMap.put(field.name(), genericRecord.get(field.name()).toString()));
genericRecord = parquetReader.read();
}
Gradle dependencies
compile 'com.amazonaws:aws-java-sdk:1.11.213'
compile 'org.apache.parquet:parquet-avro:1.9.0'
compile 'org.apache.parquet:parquet-hadoop:1.9.0'
compile 'org.apache.hadoop:hadoop-common:2.8.1'
compile 'org.apache.hadoop:hadoop-aws:2.8.1'
compile 'org.apache.hadoop:hadoop-client:2.8.1'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With