I'm receiving a set of (1 Mb) CSV/JSON files on S3 that I would like to convert to Parquet. I was expecting to be able of converting this files easily to Parquet using a Lambda function.
After looking on Google I didn't found a solution to this without have some sort of Hadoop.
Since this is a file conversion, I can't believe there is not an easy solution for this. Someone has some Java/Scala sample code to do this conversion?
Create the file for the function you update and deploy later in this tutorial. A Lambda function can use any runtime supported by AWS Lambda. For more information, see AWS Lambda runtimes.
You have to create the file in /tmp . That's the only location you are allowed to write to in the Lambda environment.
So every time a new file is uploaded to S3, the trigger gets fired invoking the lambda function to read the parquet file and write the data to dynamodb table.
If your input JSON files are not large (< 64 MB, beyond which lambda is likely to hit memory caps) and either have simple data types or you are willing to flatten the structs, you might consider using pyarrow, even though the route is slightly convoluted.
It involved using Pandas:
df = pd.read_json(file.json)
followed by converting it into parquet file:
pq = pa.parquet.write_table(df, file.pq)
Above example does auto-inference of data-types. You can override it by using the argument dtype
while loading JSON. It's only major shortcoming is that pyarrow only supports string, bool, float, int, date, time, decimal, list, array
.
Update (a more generic solution):
Consider using json2parquet.
However, if the input data has nested dictionaries, it first needs to be flattened, i.e. convert:
{a: {b: {c: d}}} to {a.b.c: d}
Then, this data needs to be ingested as pyarrow batch with json2parquet:
pa_batch = j2p.ingest_data(data)
and now the batch can be loaded as PyArrow dataframe:
df = pa.Table.from_batches([pa_batch])
and output in parquet file:
pq = pa.parquet.write_table(df, file.pq)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With