I'm receiving a set of (1 Mb) CSV/JSON files on S3 that I would like to convert to Parquet. I was expecting to be able of converting this files easily to Parquet using a Lambda function. After looking on Google I didn't found a solution to this without have some sort of Hadoop. Since this is a file conversion, I can't believe there is not an easy solution for this. Someone has some Java/Scala sample code to do this conversion?

If your input JSON files are not large (< 64 MB, beyond which lambda is likely to hit memory caps) and either have simple data types or you are willing to flatten the structs, you might consider using pyarrow, even though the route is slightly convoluted. It involved using Pandas: <pre class="prettyprint"><code>df = pd.read_json(file.json) </code></pre> followed by converting it into parquet file: <pre class="prettyprint"><code>pq = pa.parquet.write_table(df, file.pq) </code></pre> Above example does auto-inference of data-types. You can override it by using the argument <code>dtype</code> while loading JSON. It's only major shortcoming is that pyarrow only supports <code>string, bool, float, int, date, time, decimal, list, array</code>. Update (a more generic solution): Consider using json2parquet. However, if the input data has nested dictionaries, it first needs to be flattened, i.e. convert: <pre class="prettyprint"><code>{a: {b: {c: d}}} to {a.b.c: d} </code></pre> Then, this data needs to be ingested as pyarrow batch with json2parquet: <pre class="prettyprint"><code>pa_batch = j2p.ingest_data(data) </code></pre> and now the batch can be loaded as PyArrow dataframe: <pre class="prettyprint"><code>df = pa.Table.from_batches([pa_batch]) </code></pre> and output in parquet file: <pre class="prettyprint"><code>pq = pa.parquet.write_table(df, file.pq) </code></pre>

Creating a parquet file on AWS Lambda function

1 Answers

If your input JSON files are not large (< 64 MB, beyond which lambda is likely to hit memory caps) and either have simple data types or you are willing to flatten the structs, you might consider using pyarrow, even though the route is slightly convoluted.

It involved using Pandas:

df = pd.read_json(file.json)

followed by converting it into parquet file:

pq = pa.parquet.write_table(df, file.pq)

Above example does auto-inference of data-types. You can override it by using the argument dtype while loading JSON. It's only major shortcoming is that pyarrow only supports string, bool, float, int, date, time, decimal, list, array.

Update (a more generic solution):

Consider using json2parquet.

However, if the input data has nested dictionaries, it first needs to be flattened, i.e. convert:

{a: {b: {c: d}}} to {a.b.c: d}

Then, this data needs to be ingested as pyarrow batch with json2parquet:

pa_batch = j2p.ingest_data(data)

and now the batch can be loaded as PyArrow dataframe:

df = pa.Table.from_batches([pa_batch])

and output in parquet file:

pq = pa.parquet.write_table(df, file.pq)

177

answered Sep 18 '22 15:09

siberiancrane

Related questions
                            
                                Why do we need a 4th constructor for Lollipop?
                            
                                How many memory copies do object variables in Python have? [duplicate]
                            
                                Store method parameter names for some classes when building with Gradle/Java8
                            
                                How to solve "glUtilsParamSize" problems
                            
                                Java 8 Optional<?> validation unwrapper in Spring
                            
                                JavaFX - How to create SnapShot/Screenshot of (invisble) WebView
                            
                                How to detect ambiguous method calls that would cause a ClassCastException in Java 8?
                            
                                How to set a system property for the log4j2 JUL adapter in an OSGi environment
                            
                                Web project doesn't run on server if a dependent project is opened in eclipse
                            
                                Mapped Diagnostic Context logging with play framewok and akka in java
                            
                                Can a date string be validated by DateTimeFormatter without catching exception?
                            
                                Lollipop AppBarLayout/Toolbar missing overscroll animation
                            
                                Java Spring - RowMapper ResultSet - Integer/null values
                            
                                Execution failed for task :app:compileDebugJavaWithJavac
                            
                                Problems Importing Scala Libraries In Intellij
                            
                                How to convert Java Swagger Annotation to Swagger json schema? [closed]
                            
                                Spring Boot doesn't load application.yml config
                            
                                Allure integration with multi-module test suite
                            
                                Unit Test - Verify Observable is subscribed
                            
                                ANDROID: Error W/DynamiteModule: Local module descriptor class for com.google.android.gms.googlecertificates not found

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Creating a parquet file on AWS Lambda function

Tags:

java

amazon-web-services

scala

parquet

oleber

People also ask

1 Answers

siberiancrane

Recent Activity

Donate For Us