Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark dataframe to arrow

I have been using Apache Arrow with Spark for a while in Python and have been easily able to convert between dataframes and Arrow objects by using Pandas as an intermediary.

Recently, however, I’ve moved from Python to Scala for interacting with Spark and using Arrow isn’t as intuitive in Scala (Java) as it is in Python. My basic need is to convert a Spark dataframe (or RDD since they’re easily convertible) to an Arrow object as quickly as possible. My initial thought was to convert to Parquet first and go from Parquet to Arrow since I remembered that pyarrow could read from Parquet. However, and please correct me if I’m wrong, after looking at the Arrow Java docs for a while I couldn’t find a Parquet to Arrow function. Does this function not exist in the Java version? Is there another way to get a Spark dataframe to an Arrow object? Perhaps converting the dataframe's columns to arrays then converting to arrow objects?

Any help would be much appreciated. Thank you

EDIT: Found the following link that converts a parquet schema to an Arrow schema. But it doesn't seem to return an Arrow object from a parquet file like I need: https://github.com/apache/parquet-mr/blob/70f28810a5547219e18ffc3465f519c454fee6e5/parquet-arrow/src/main/java/org/apache/parquet/arrow/schema/SchemaConverter.java

like image 646
supert165 Avatar asked Jul 27 '17 17:07

supert165


People also ask

How do you turn on arrows in PySpark?

Arrow is available as an optimization when converting a PySpark DataFrame to a pandas DataFrame with toPandas() and when creating a PySpark DataFrame from a pandas DataFrame with createDataFrame(pandas_df) . To use Arrow for these methods, set the Spark configuration spark. sql. execution.

Why does toPandas take so long?

Looking at the source code for toPandas() , one reason it may be slow is because it first creates the pandas DataFrame , and then copies each of the Series in that DataFrame over to the returned DataFrame .

Does Spark use arrow?

The Arrow format allows serializing and shipping columnar data over the network - or any kind of streaming transport. Apache Spark uses Arrow as a data interchange format, and both PySpark and sparklyr can take advantage of Arrow for significant performance gains when transferring data.

Why is PyArrow used?

1.1. Why Use PyArrow with PySpark. Apache Arrow helps to accelerate converting to pandas objects from traditional columnar memory providing the high-performance in-memory columnar data structures. Previously, Spark reveals a row-based interface for interpreting and running user-defined functions (UDFs).


3 Answers

Now there's an answer, Arrow can be used to convert Spark DataFrames to Pandas DataFrames or when calling Pandas UDFs. Please see the SQL PySpark Pandas with Arrow documentation page.

like image 50
Douglas M Avatar answered Oct 08 '22 23:10

Douglas M


Spark 3.3 will have mapInArrow API call, similar to already existing mapInPandas API call.

Here's first PR that adds this to Python - https://github.com/apache/spark/pull/34505

There will be another similar Spark Scala API call too by the time 3.3 releases.

Not sure what's exactly your use case, but this seems may help.

PS. Notice initially this API is planned as a developer-level, as working with Arrow may not be very user-friendly at first. This may be great if you're developing a library on top of Spark/Arrow, for example, when you can abstract away some of those Arrow nuances.

like image 32
Tagar Avatar answered Oct 08 '22 21:10

Tagar


There is not a Parquet <-> Arrow converter available as a library in Java yet. You could have a look at the Arrow-based Parquet converter in Dremio (https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet) for inspiration. I am sure the Apache Parquet project would welcome your contribution implementing this functionality.

We have developed an Arrow reader/writer for Parquet in the C++ implementation: https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow. Nested data support is not complete yet, but it should be more complete within the next 6-12 months (sooner as contributors step up).

like image 5
Wes McKinney Avatar answered Oct 08 '22 23:10

Wes McKinney