How to get list of all columns from a parquet file using s3 select?

Question

I have a parquet file stored in S3 bucket. I want to get the list of all columns of the parquet file. I am using s3 select but it just give me list of all rows wihtout any column headers.

Is there anyway to get all column names from this parquet file without downloading it completely? Since parquet file can be very large, I would not want to download the entire parquet file which is why I am using s3 select to pick first few rows using

select * from S3Object LIMIT 10

I tried to fetch column names explicitly by doing

SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'S3Object'

but it would not work as AWS S3 doesn't support this yet.

Is there any other way to achieve the same?

sergej · Accepted Answer

I have the same problem but unfortunately, my Google-Kung-Fu was not strong enough this time.

I found the following workaround which I don't really like but it works for me:

r = s3.select_object_content(Bucket='...your bucket...', 
                             Key='...your key...', 
                             ExpressionType='SQL', 
                             Expression="select s.* from S3Object s limit 1",
                             InputSerialization={'Parquet': {}}, 
                              OutputSerialization={'JSON': {}})
row = json.loads([rec['Records']['Payload'].decode('utf-8') for rec in r['Payload'] if 'Records' in rec][0])

print("Columns: ", list(row.keys()))

I.e. what the code does is to request the first line of the data, to extract the payload and to load the returned JSON object. The obtained JSON object has the following structure {"Column name": "value", ....}, such that one only has to extract the keys of the JSON object (last line).

One additional problem is that this does not return the types of columns. This is something that I could not solve yet.

UPDATE: I observed that in some situations, some column names were not correct reported. Instead of the real name something like _18, _19 were returned. No idea how to deal with it.

How to get list of all columns from a parquet file using s3 select?

Tags:

java

sql

amazon-s3

parquet

amazon-s3-select

CodeHunter

1 Answers

sergej

Recent Activity

Donate For Us

How to get list of all columns from a parquet file using s3 select?

Tags:

java

sql

amazon-s3

parquet

amazon-s3-select

CodeHunter

1 Answers

sergej

Related questions

Recent Activity

Donate For Us