Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

S3 Select retrieve headers in the CSV

I am trying to fetch a subset of records from a CSV stored in an S# bucket using the following code:

s3 = boto3.client('s3')
bucket = bucket
file_name = file

sql_stmt = """SELECT S.* FROM s3object S LIMIT 10"""


req = s3.select_object_content(
    Bucket=bucket,
    Key=file,
    ExpressionType='SQL',
    Expression=sql_stmt,
    InputSerialization = {'CSV': {'FileHeaderInfo': 'USE'}},
    OutputSerialization = {'CSV': {}},
)

records = []
for event in req['Payload']:
    if 'Records' in event:
        records.append(event['Records']['Payload'])
    elif 'Stats' in event:
        stats = event['Stats']['Details']


file_str = ''.join(r.decode('utf-8') for r in records)

select_df = pd.read_csv(StringIO(file_str))
df = pd.DataFrame(select_df)
print(df)

This successfully yields the records but misses out on headers.

I read here S3 Select CSV Headers that S3 Select does not yield headers at all. So, is it possible to retrieve the headers of a CSV file in S3 in any other way?

like image 619
Sumedha Nagpal Avatar asked Apr 25 '19 22:04

Sumedha Nagpal


People also ask

What is data returned by S3 select?

With S3 Select, you can use a simple SQL expression to return only the data from the store you're interested in, instead of retrieving the entire object. This means you're dealing with an order of magnitude less data which improves the performance of your underlying applications.

How do I retrieve data from AWS S3?

In the Amazon S3 console, choose your S3 bucket, choose the file that you want to open or download, choose Actions, and then choose Open or Download. If you are downloading an object, specify where you want to save it. The procedure for saving the object depends on the browser and operating system that you are using.

Can you query data from S3 bucket?

With Amazon S3 Select, you can use simple structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve just the subset of data that you need.

Does S3 select use Athena?

Athena uses Presto when you run SQL queries on Amazon S3. You can run ANSI-Compliant SQL SELECT statements to query your data in Amazon S3.


1 Answers

Change InputSerialization={'CSV': {"FileHeaderInfo": "Use"}},

TO InputSerialization={'CSV': {"FileHeaderInfo": "NONE"}},

Then, it will print full content, including the header.

Explanation:

FileHeaderInfo accepts one of "NONE" OR "USE" OR "IGNORE".

Use NONE option rather then USE, it will then print header as well, as NONE tells that you need header as well for processing.

Here is reference. https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.select_object_content

I hope it helps.

like image 70
Red Boy Avatar answered Sep 25 '22 10:09

Red Boy