pandasUDF and pyarrow 0.15.0

Tags:

I have recently started getting a bunch of errors on a number of pyspark jobs running on EMR clusters. The erros are

java.lang.IllegalArgumentException
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
    at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
    at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
    at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
    at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
    at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
    at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:98)
    at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:96)
    at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:127)...

They all seem to happen in apply functions of a pandas series. The only change I found is that pyarrow has been updated on Saturday (05/10/2019). Tests seem to work with 0.14.1

So my question is if anyone know if this is a bug in the new updated pyarrow or is there some significant change that will make pandasUDF hard to use in the future?

332

asked Oct 07 '19 15:10

ilijaluve

1 Answers

It's not a bug. We made an important protocol change in 0.15.0 that makes the default behavior of pyarrow incompatible with older versions of Arrow in Java -- your Spark environment seems to be using an older version.

Your options are

Set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1 from where you are using Python
Downgrade to pyarrow < 0.15.0 for now.

Hopefully the Spark community will be able to upgrade to 0.15.0 in Java soon so this issue goes away.

This is discussed in http://arrow.apache.org/blog/2019/10/06/0.15.0-release/

answered Oct 05 '22 19:10

Wes McKinney

Related questions
                            
                                Should a pandas dataframe column be converted in some way before passing it to a scikit learn regressor?
                            
                                Combine columns in a Pandas DataFrame to a column of lists in a DataFrame
                            
                                conditional row read of csv in pandas
                            
                                Python Pandas: How to move one row to the first row of a Dataframe?
                            
                                Fast way to split column into multiple rows in Pandas
                            
                                Pandas dataframe: Group by two columns and then average over another column
                            
                                Interval datatype in Pandas - find midpoint, left, center etc
                            
                                Suppress Scientific Format in a Dataframe Column
                            
                                Pandas DataFrame check if column value exists in a group of columns
                            
                                Best way to flatten dataframe based on values on column
                            
                                Getting the average of a certain hour on weekdays over several years in a pandas dataframe
                            
                                Week of a month pandas
                            
                                Calculate new value based on decreasing value
                            
                                Pandas: cannot filter based on string equality
                            
                                Pandas drop_duplicates - TypeError: type object argument after * must be a sequence, not map
                            
                                How to do Multi-Column from_tuples?
                            
                                Time Wheel in python3 pandas
                            
                                Remove reverse duplicates from dataframe
                            
                                Pandas: How to make apply on dataframe faster?
                            
                                Get unique values from pandas series of lists

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pandasUDF and pyarrow 0.15.0

Tags:

pandas

apache-spark

pyspark

pyarrow

ilijaluve

People also ask

1 Answers

Wes McKinney

Recent Activity

Donate For Us