I have been using Apache Arrow with Spark for a while in Python and have been easily able to convert between dataframes and Arrow objects by using Pandas as an intermediary.
Recently, however, I’ve moved from Python to Scala for interacting with Spark and using Arrow isn’t as intuitive in Scala (Java) as it is in Python. My basic need is to convert a Spark dataframe (or RDD since they’re easily convertible) to an Arrow object as quickly as possible. My initial thought was to convert to Parquet first and go from Parquet to Arrow since I remembered that pyarrow could read from Parquet. However, and please correct me if I’m wrong, after looking at the Arrow Java docs for a while I couldn’t find a Parquet to Arrow function. Does this function not exist in the Java version? Is there another way to get a Spark dataframe to an Arrow object? Perhaps converting the dataframe's columns to arrays then converting to arrow objects?
Any help would be much appreciated. Thank you
EDIT: Found the following link that converts a parquet schema to an Arrow schema. But it doesn't seem to return an Arrow object from a parquet file like I need: https://github.com/apache/parquet-mr/blob/70f28810a5547219e18ffc3465f519c454fee6e5/parquet-arrow/src/main/java/org/apache/parquet/arrow/schema/SchemaConverter.java
Arrow is available as an optimization when converting a PySpark DataFrame to a pandas DataFrame with toPandas() and when creating a PySpark DataFrame from a pandas DataFrame with createDataFrame(pandas_df) . To use Arrow for these methods, set the Spark configuration spark. sql. execution.
Looking at the source code for toPandas() , one reason it may be slow is because it first creates the pandas DataFrame , and then copies each of the Series in that DataFrame over to the returned DataFrame .
The Arrow format allows serializing and shipping columnar data over the network - or any kind of streaming transport. Apache Spark uses Arrow as a data interchange format, and both PySpark and sparklyr can take advantage of Arrow for significant performance gains when transferring data.
1.1. Why Use PyArrow with PySpark. Apache Arrow helps to accelerate converting to pandas objects from traditional columnar memory providing the high-performance in-memory columnar data structures. Previously, Spark reveals a row-based interface for interpreting and running user-defined functions (UDFs).
Now there's an answer, Arrow can be used to convert Spark DataFrames to Pandas DataFrames or when calling Pandas UDFs. Please see the SQL PySpark Pandas with Arrow documentation page.
Spark 3.3 will have mapInArrow
API call, similar to already existing mapInPandas
API call.
Here's first PR that adds this to Python - https://github.com/apache/spark/pull/34505
There will be another similar Spark Scala API call too by the time 3.3 releases.
Not sure what's exactly your use case, but this seems may help.
PS. Notice initially this API is planned as a developer-level, as working with Arrow may not be very user-friendly at first. This may be great if you're developing a library on top of Spark/Arrow, for example, when you can abstract away some of those Arrow nuances.
There is not a Parquet <-> Arrow converter available as a library in Java yet. You could have a look at the Arrow-based Parquet converter in Dremio (https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet) for inspiration. I am sure the Apache Parquet project would welcome your contribution implementing this functionality.
We have developed an Arrow reader/writer for Parquet in the C++ implementation: https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow. Nested data support is not complete yet, but it should be more complete within the next 6-12 months (sooner as contributors step up).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With