Spark dataframe to arrow

Tags:

I have been using Apache Arrow with Spark for a while in Python and have been easily able to convert between dataframes and Arrow objects by using Pandas as an intermediary.

Recently, however, I’ve moved from Python to Scala for interacting with Spark and using Arrow isn’t as intuitive in Scala (Java) as it is in Python. My basic need is to convert a Spark dataframe (or RDD since they’re easily convertible) to an Arrow object as quickly as possible. My initial thought was to convert to Parquet first and go from Parquet to Arrow since I remembered that pyarrow could read from Parquet. However, and please correct me if I’m wrong, after looking at the Arrow Java docs for a while I couldn’t find a Parquet to Arrow function. Does this function not exist in the Java version? Is there another way to get a Spark dataframe to an Arrow object? Perhaps converting the dataframe's columns to arrays then converting to arrow objects?

Any help would be much appreciated. Thank you

EDIT: Found the following link that converts a parquet schema to an Arrow schema. But it doesn't seem to return an Arrow object from a parquet file like I need: https://github.com/apache/parquet-mr/blob/70f28810a5547219e18ffc3465f519c454fee6e5/parquet-arrow/src/main/java/org/apache/parquet/arrow/schema/SchemaConverter.java

646

asked Jul 27 '17 17:07

supert165

3 Answers

Now there's an answer, Arrow can be used to convert Spark DataFrames to Pandas DataFrames or when calling Pandas UDFs. Please see the SQL PySpark Pandas with Arrow documentation page.

answered Oct 08 '22 23:10

Douglas M

Spark 3.3 will have mapInArrow API call, similar to already existing mapInPandas API call.

Here's first PR that adds this to Python - https://github.com/apache/spark/pull/34505

There will be another similar Spark Scala API call too by the time 3.3 releases.

Not sure what's exactly your use case, but this seems may help.

PS. Notice initially this API is planned as a developer-level, as working with Arrow may not be very user-friendly at first. This may be great if you're developing a library on top of Spark/Arrow, for example, when you can abstract away some of those Arrow nuances.

answered Oct 08 '22 21:10

Tagar

There is not a Parquet <-> Arrow converter available as a library in Java yet. You could have a look at the Arrow-based Parquet converter in Dremio (https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet) for inspiration. I am sure the Apache Parquet project would welcome your contribution implementing this functionality.

We have developed an Arrow reader/writer for Parquet in the C++ implementation: https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow. Nested data support is not complete yet, but it should be more complete within the next 6-12 months (sooner as contributors step up).

answered Oct 08 '22 23:10

Wes McKinney

Related questions
                            
                                Single Chinese character determined as length 2 in Java/Scala String
                            
                                How do I generate a random number using functional state?
                            
                                Conditionally include provided scope dependencies with sbt and the universal plugin
                            
                                How to run Spark Scala code on Amazon EMR
                            
                                Scala cats, traverse Seq
                            
                                Single documentation for mixed (Scala/Java) project?
                            
                                What is the most elegant way to deal with an external library with internal state using a function programming language?
                            
                                How to fallback Scala version for SBT dependencies?
                            
                                Filtering resources in SBT
                            
                                Scala 'fromFile' weirdness?
                            
                                Overriding Java interface with overloaded vargs methods in Scala
                            
                                In Scala 2.10, how do you create a ClassTag given a TypeTag
                            
                                Monadic fold with State monad in constant space (heap and stack)?
                            
                                Why "set" can't assign value to custom SettingKey I can "show" in sbt shell?
                            
                                Scala using toSet.toList vs distinct
                            
                                Verifying mocked object method calls with default arguments
                            
                                Merge Sets of Sets that contain common elements in Scala
                            
                                How does @Inject in Scala work
                            
                                Using Slick with shapeless HList
                            
                                Spark: difference of semantics between reduce and reduceByKey

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark dataframe to arrow

Tags:

dataframe

scala

apache-spark

apache-arrow