Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a parquet file on AWS Lambda function

I'm receiving a set of (1 Mb) CSV/JSON files on S3 that I would like to convert to Parquet. I was expecting to be able of converting this files easily to Parquet using a Lambda function.

After looking on Google I didn't found a solution to this without have some sort of Hadoop.

Since this is a file conversion, I can't believe there is not an easy solution for this. Someone has some Java/Scala sample code to do this conversion?

like image 935
oleber Avatar asked Jan 06 '17 10:01

oleber


People also ask

Can we create a file in AWS Lambda?

Create the file for the function you update and deploy later in this tutorial. A Lambda function can use any runtime supported by AWS Lambda. For more information, see AWS Lambda runtimes.

How do I create a file in AWS Lambda?

You have to create the file in /tmp . That's the only location you are allowed to write to in the Lambda environment.

Can Lambda read parquet?

So every time a new file is uploaded to S3, the trigger gets fired invoking the lambda function to read the parquet file and write the data to dynamodb table.


1 Answers

If your input JSON files are not large (< 64 MB, beyond which lambda is likely to hit memory caps) and either have simple data types or you are willing to flatten the structs, you might consider using pyarrow, even though the route is slightly convoluted.

It involved using Pandas:

df = pd.read_json(file.json)

followed by converting it into parquet file:

pq = pa.parquet.write_table(df, file.pq)

Above example does auto-inference of data-types. You can override it by using the argument dtype while loading JSON. It's only major shortcoming is that pyarrow only supports string, bool, float, int, date, time, decimal, list, array.

Update (a more generic solution):

Consider using json2parquet.

However, if the input data has nested dictionaries, it first needs to be flattened, i.e. convert:

{a: {b: {c: d}}} to {a.b.c: d}

Then, this data needs to be ingested as pyarrow batch with json2parquet:

pa_batch = j2p.ingest_data(data)

and now the batch can be loaded as PyArrow dataframe:

df = pa.Table.from_batches([pa_batch])

and output in parquet file:

pq = pa.parquet.write_table(df, file.pq)
like image 177
siberiancrane Avatar answered Sep 18 '22 15:09

siberiancrane