Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read excel file from S3 into Pandas DataFrame

I have a SNS notification setup that triggers a Lambda function when a .xlsx file is uploaded to S3 bucket.

The lambda function reads the .xlsx file into Pandas DataFrame.

import os 
import pandas as pd
import json
import xlrd
import boto3

def main(event, context):
    message = event['Records'][0]['Sns']['Message']
    parsed_message = json.loads(message)
    src_bucket = parsed_message['Records'][0]['s3']['bucket']['name']
    filepath = parsed_message['Records'][0]['s3']['object']['key']

    s3 = boto3.resource('s3')
    s3_client = boto3.client('s3')

    obj = s3_client.get_object(Bucket=src_bucket, Key=filepath)
    print(obj['Body'])

    df = pd.read_excel(obj, header=2)
    print(df.head(2))

I get an error as below:

Invalid file path or buffer object type: <type 'dict'>: ValueError
Traceback (most recent call last):
File "/var/task/handler.py", line 26, in main
df = pd.read_excel(obj, header=2)
File "/var/task/pandas/util/_decorators.py", line 178, in wrapper
return func(*args, **kwargs)
File "/var/task/pandas/util/_decorators.py", line 178, in wrapper
return func(*args, **kwargs)
File "/var/task/pandas/io/excel.py", line 307, in read_excel
io = ExcelFile(io, engine=engine)
File "/var/task/pandas/io/excel.py", line 376, in __init__
io, _, _, _ = get_filepath_or_buffer(self._io)
File "/var/task/pandas/io/common.py", line 218, in get_filepath_or_buffer
raise ValueError(msg.format(_type=type(filepath_or_buffer)))
ValueError: Invalid file path or buffer object type: <type 'dict'>

How can I resolve this?

like image 601
Raj Avatar asked Jan 14 '19 16:01

Raj


People also ask

How do you read data from Excel file in Python using pandas?

To read an excel file as a DataFrame, use the pandas read_excel() method. You can read the first sheet, specific sheets, multiple sheets or all sheets. Pandas converts this to the DataFrame structure, which is a tabular like structure.

How can I read XLSX file in pandas?

Use pandas. read_excel() function to read excel sheet into pandas DataFrame, by default it loads the first sheet from the excel file and parses the first row as a DataFrame column name. Excel file has an extension . xlsx.

Can pandas read Excel files?

Read an Excel file into a pandas DataFrame. Supports xls , xlsx , xlsm , xlsb , odf , ods and odt file extensions read from a local filesystem or URL. Supports an option to read a single sheet or a list of sheets. Any valid string path is acceptable.

Can S3 store Excel files?

You can directly read xls file from S3 without having to download or save it locally. xlrd module has a provision to provide raw data to create workbook object.


3 Answers

It is perfectly normal! obj is a dictionnary, have u tried ?

df = pd.read_excel(obj['body'], header=2)
like image 72
Tarik Elkalai Avatar answered Sep 30 '22 04:09

Tarik Elkalai


Pandas now supports s3 URL as a file path so it can read the excel file directly from s3 without downloading it first.

See here for a CSV example - https://stackoverflow.com/a/51777553/52954

like image 32
LiorH Avatar answered Sep 30 '22 06:09

LiorH


try pd.read_excel(obj['Body'].read())

like image 45
Ritman Cronestar Avatar answered Sep 30 '22 04:09

Ritman Cronestar