I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is: <ul> <li> https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be installed (numpy amongst others).</li> <li>This procedure to generate a zip file: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python </li> <li>Add a test python function to the zip, send it to S3, update the lambda and test it</li> </ul> It seems that there are two possible approaches, which both work locally to the docker container: <ol> <li>fastparquet with s3fs: Unfortunately the unzipped size of the package is bigger than 256MB and therefore I can't update the Lambda code with it.</li> <li> pyarrow with s3fs: I followed https://github.com/apache/arrow/pull/916 and when executed with the lambda function I get either: <ul> <li> If I prefix the URI with S3 or S3N (as in the code example): In the Lambda environment <code>OSError: Passed non-file path: s3://mybucket/path/to/myfile</code> in pyarrow/parquet.py, line 848. Locally I get <code>IndexError: list index out of range</code> in pyarrow/parquet.py, line 714</li> <li> If I don't prefix the URI with S3 or S3N: It works locally (I can read the parquet data). In the Lambda environment, I get the same <code>OSError: Passed non-file path: s3://mybucket/path/to/myfile</code> in pyarrow/parquet.py, line 848. </li> </ul> </li> </ol> My questions are : <ul> <li>why do I get a different result in my docker container than I do in the Lambda environment?</li> <li>what is the proper way to give the URI?</li> <li>is there an accepted way to read Parquet files in S3 through AWS Lambda? </li> </ul> Thanks!

I was able to accomplish writing parquet files into S3 using fastparquet. It's a little tricky but my breakthrough came when I realized that to put together all the dependencies, I had to use the same exact Linux that Lambda is using. Here's how I did it: <h3>1. Spin up a EC2 instance using the Amazon Linux image that is used with Lambda</h3> Source: https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html Linux image: https://console.aws.amazon.com/ec2/v2/home#Images:visibility=public-images;search=amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 Note: you might need to install many packages and change python version to 3.6 as this Linux is not meant for development. Here's how I looked for packages: <pre class="prettyprint"><code>sudo yum list | grep python3 </code></pre> I installed: <pre class="prettyprint"><code>python36.x86_64 python36-devel.x86_64 python36-libs.x86_64 python36-pip.noarch python36-setuptools.noarch python36-tools.x86_64 </code></pre> <h3>2. Used the instructions from here to built a zip file with all of the dependencies that my script would use with dumping them all in a folder and the zipping them with this command:</h3> <pre class="prettyprint"><code>mkdir parquet cd parquet pip install -t . fastparquet pip install -t . (any other dependencies) copy my python file in this folder zip and upload into Lambda </code></pre> Note: there are some constraints I had to work around: Lambda doesn't let you upload zip larger 50M and unzipped > 260M. If anyone knows a better way to get dependencies into Lambda, please do share. Source: Write parquet from AWS Kinesis firehose to AWS S3

Read Parquet file stored in S3 with AWS Lambda (Python 3)

Tags:

python

amazon-s3

aws-lambda

parquet

pyarrow

I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is:

https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be installed (numpy amongst others).
This procedure to generate a zip file: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python
Add a test python function to the zip, send it to S3, update the lambda and test it

It seems that there are two possible approaches, which both work locally to the docker container:

fastparquet with s3fs: Unfortunately the unzipped size of the package is bigger than 256MB and therefore I can't update the Lambda code with it.
pyarrow with s3fs: I followed https://github.com/apache/arrow/pull/916 and when executed with the lambda function I get either:
- If I prefix the URI with S3 or S3N (as in the code example): In the Lambda environment OSError: Passed non-file path: s3://mybucket/path/to/myfile in pyarrow/parquet.py, line 848. Locally I get IndexError: list index out of range in pyarrow/parquet.py, line 714
- If I don't prefix the URI with S3 or S3N: It works locally (I can read the parquet data). In the Lambda environment, I get the same OSError: Passed non-file path: s3://mybucket/path/to/myfile in pyarrow/parquet.py, line 848.

My questions are :

why do I get a different result in my docker container than I do in the Lambda environment?
what is the proper way to give the URI?
is there an accepted way to read Parquet files in S3 through AWS Lambda?

Thanks!

539

asked Dec 26 '17 22:12

Ptah

2 Answers

AWS has a project (AWS Data Wrangler) that allows it with full Lambda Layers support.

In the Docs there is a step-by-step to do it.

Code example:

import awswrangler as wr

# Write
wr.s3.to_parquet(
    dataframe=df,
    path="s3://...",
    dataset=True,
    database="my_database",  # Optional, only with you want it available on Athena/Glue Catalog
    table="my_table",
    partition_cols=["PARTITION_COL_NAME"])

# READ
df = wr.s3.read_parquet(path="s3://...")

Reference

160

answered Oct 06 '22 10:10

Igor Tavares

I was able to accomplish writing parquet files into S3 using fastparquet. It's a little tricky but my breakthrough came when I realized that to put together all the dependencies, I had to use the same exact Linux that Lambda is using.

Here's how I did it:

1. Spin up a EC2 instance using the Amazon Linux image that is used with Lambda

Source: https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html

Linux image: https://console.aws.amazon.com/ec2/v2/home#Images:visibility=public-images;search=amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2

Note: you might need to install many packages and change python version to 3.6 as this Linux is not meant for development. Here's how I looked for packages:

sudo yum list | grep python3

I installed:

python36.x86_64
python36-devel.x86_64
python36-libs.x86_64
python36-pip.noarch
python36-setuptools.noarch
python36-tools.x86_64

2. Used the instructions from here to built a zip file with all of the dependencies that my script would use with dumping them all in a folder and the zipping them with this command:

mkdir parquet
cd parquet
pip install -t . fastparquet 
pip install -t . (any other dependencies)
copy my python file in this folder
zip and upload into Lambda

Note: there are some constraints I had to work around: Lambda doesn't let you upload zip larger 50M and unzipped > 260M. If anyone knows a better way to get dependencies into Lambda, please do share.

Source: Write parquet from AWS Kinesis firehose to AWS S3

answered Oct 06 '22 09:10

phoenix

Related questions
                            
                                Is there any equivalent to the Perl regexes' \K backslash sequence in Python?
                            
                                How to write a csv with a comma as the decimal separator?
                            
                                Converting Json to SQL table
                            
                                Can we see the group data in pandas.core.groupby.SeriesGroupBy object
                            
                                Inside virtual env, "sudo pip" links to the global python pip
                            
                                How to use Flasgger with Flask applications using Blueprints?
                            
                                Compute inverse of 2D arrays along the third axis in a 3D array without loops
                            
                                TypeError: '<' not supported between instances of 'tuple' and 'str'
                            
                                Perfect forwarding - in Python
                            
                                How should I interpret the output of numpy.fft.rfft2?
                            
                                How to register a custom gradient for a operation composed of tf operations
                            
                                Pandas Vectorized Date Offset Operations with Vector of Differing Offsets
                            
                                Assign a Series to several Rows of a Pandas DataFrame
                            
                                Python test fixture to run a single test?
                            
                                Getting 405 error while trying to download nltk data
                            
                                Why doesn't super() work with static methods other than __new__?
                            
                                Accepting integers as keys of **kwargs
                            
                                manage.py doesn't log to stdout/stderr in Docker on Raspberry Pi
                            
                                Mean Euclidean distance in Tensorflow
                            
                                Keras - how to get unnormalized logits instead of probabilities

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With