I have a parquet file stored in S3 bucket. I want to get the list of all columns of the parquet file. I am using s3 select
but it just give me list of all rows wihtout any column headers.
Is there anyway to get all column names from this parquet file without downloading it completely? Since parquet file can be very large, I would not want to download the entire parquet file which is why I am using s3 select
to pick first few rows using
select * from S3Object LIMIT 10
I tried to fetch column names explicitly by doing
SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'S3Object'
but it would not work as AWS S3 doesn't support this yet.
Is there any other way to achieve the same?
I have the same problem but unfortunately, my Google-Kung-Fu was not strong enough this time.
I found the following workaround which I don't really like but it works for me:
r = s3.select_object_content(Bucket='...your bucket...',
Key='...your key...',
ExpressionType='SQL',
Expression="select s.* from S3Object s limit 1",
InputSerialization={'Parquet': {}},
OutputSerialization={'JSON': {}})
row = json.loads([rec['Records']['Payload'].decode('utf-8') for rec in r['Payload'] if 'Records' in rec][0])
print("Columns: ", list(row.keys()))
I.e. what the code does is to request the first line of the data, to extract the payload and to load the returned JSON object. The obtained JSON object has the following structure {"Column name": "value", ....}
, such that one only has to extract the keys of the JSON object (last line).
One additional problem is that this does not return the types of columns. This is something that I could not solve yet.
UPDATE: I observed that in some situations, some column names were not correct reported. Instead of the real name something like _18, _19 were returned. No idea how to deal with it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With