I am trying to fetch a subset of records from a CSV stored in an S# bucket using the following code:
s3 = boto3.client('s3')
bucket = bucket
file_name = file
sql_stmt = """SELECT S.* FROM s3object S LIMIT 10"""
req = s3.select_object_content(
Bucket=bucket,
Key=file,
ExpressionType='SQL',
Expression=sql_stmt,
InputSerialization = {'CSV': {'FileHeaderInfo': 'USE'}},
OutputSerialization = {'CSV': {}},
)
records = []
for event in req['Payload']:
if 'Records' in event:
records.append(event['Records']['Payload'])
elif 'Stats' in event:
stats = event['Stats']['Details']
file_str = ''.join(r.decode('utf-8') for r in records)
select_df = pd.read_csv(StringIO(file_str))
df = pd.DataFrame(select_df)
print(df)
This successfully yields the records but misses out on headers.
I read here S3 Select CSV Headers that S3 Select does not yield headers at all. So, is it possible to retrieve the headers of a CSV file in S3 in any other way?
With S3 Select, you can use a simple SQL expression to return only the data from the store you're interested in, instead of retrieving the entire object. This means you're dealing with an order of magnitude less data which improves the performance of your underlying applications.
In the Amazon S3 console, choose your S3 bucket, choose the file that you want to open or download, choose Actions, and then choose Open or Download. If you are downloading an object, specify where you want to save it. The procedure for saving the object depends on the browser and operating system that you are using.
With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need.
Athena uses Presto when you run SQL queries on Amazon S3. You can run ANSI-Compliant SQL SELECT statements to query your data in Amazon S3.
Change InputSerialization={'CSV': {"FileHeaderInfo": "Use"}},
TO InputSerialization={'CSV': {"FileHeaderInfo": "NONE"}},
Then, it will print full content, including the header
.
Explanation:
FileHeaderInfo
accepts one of "NONE" OR "USE" OR "IGNORE".
Use NONE
option rather then USE
, it will then print header
as well, as NONE
tells that you need header
as well for processing
.
Here is reference. https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.select_object_content
I hope it helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With