I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is:
It seems that there are two possible approaches, which both work locally to the docker container:
pyarrow with s3fs: I followed https://github.com/apache/arrow/pull/916 and when executed with the lambda function I get either:
OSError: Passed non-file path: s3://mybucket/path/to/myfile
in pyarrow/parquet.py, line 848. Locally I get IndexError: list index out of range
in pyarrow/parquet.py, line 714OSError: Passed non-file path: s3://mybucket/path/to/myfile
in pyarrow/parquet.py, line 848. My questions are :
Thanks!
The Lambda function retrieves the source S3 bucket name and the key name of the uploaded object from the event parameter that it receives. The function uses the Amazon S3 getObject API to retrieve the content type of the object.
Read the parquet file (specified columns) into pandas dataframe. Convert pandas dataframe column with Timestamp datatype to epoch time in number for record to be stored in dynamodb. Convert the final pandas record dataframe to return list of dictionary. Write the records to dynamodb.
AWS has a project (AWS Data Wrangler) that allows it with full Lambda Layers support.
In the Docs there is a step-by-step to do it.
Code example:
import awswrangler as wr
# Write
wr.s3.to_parquet(
dataframe=df,
path="s3://...",
dataset=True,
database="my_database", # Optional, only with you want it available on Athena/Glue Catalog
table="my_table",
partition_cols=["PARTITION_COL_NAME"])
# READ
df = wr.s3.read_parquet(path="s3://...")
Reference
I was able to accomplish writing parquet files into S3 using fastparquet. It's a little tricky but my breakthrough came when I realized that to put together all the dependencies, I had to use the same exact Linux that Lambda is using.
Here's how I did it:
Source: https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html
Linux image: https://console.aws.amazon.com/ec2/v2/home#Images:visibility=public-images;search=amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2
Note: you might need to install many packages and change python version to 3.6 as this Linux is not meant for development. Here's how I looked for packages:
sudo yum list | grep python3
I installed:
python36.x86_64
python36-devel.x86_64
python36-libs.x86_64
python36-pip.noarch
python36-setuptools.noarch
python36-tools.x86_64
mkdir parquet
cd parquet
pip install -t . fastparquet
pip install -t . (any other dependencies)
copy my python file in this folder
zip and upload into Lambda
Note: there are some constraints I had to work around: Lambda doesn't let you upload zip larger 50M and unzipped > 260M. If anyone knows a better way to get dependencies into Lambda, please do share.
Source: Write parquet from AWS Kinesis firehose to AWS S3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With