how to enable Apache Arrow in Pyspark

Tags:

I am trying to enable Apache Arrow for conversion to Pandas. I am using:

pyspark 2.4.4 pyarrow 0.15.0 pandas 0.25.1 numpy 1.17.2

This is the example code

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
x = pd.Series([1, 2, 3])
df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))

I got this warning message

c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\pyspark\sql\session.py:714: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
  An error occurred while calling z:org.apache.spark.sql.api.python.PythonSQLUtils.readArrowStreamFromFile.
: java.lang.IllegalArgumentException
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
    at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.<init>(ArrowConverters.scala:229)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$.getBatchesFromStream(ArrowConverters.scala:228)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$readArrowStreamFromFile$2.apply(ArrowConverters.scala:216)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$readArrowStreamFromFile$2.apply(ArrowConverters.scala:214)
    at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$.readArrowStreamFromFile(ArrowConverters.scala:214)
    at org.apache.spark.sql.api.python.PythonSQLUtils$.readArrowStreamFromFile(PythonSQLUtils.scala:46)
    at org.apache.spark.sql.api.python.PythonSQLUtils.readArrowStreamFromFile(PythonSQLUtils.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.
  warnings.warn(msg)

577

asked Oct 07 '19 11:10

2 Answers

We made a change in 0.15.0 that makes the default behavior of pyarrow incompatible with older versions of Arrow in Java -- your Spark environment seems to be using an older version.

Your options are

Set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1 from where you are using Python
Downgrade to pyarrow < 0.15.0 for now.

answered Oct 21 '22 12:10

Wes McKinney

For calling my pandas UDF in my Spark 2.4.4 cluster with pyarrow==0.15. I struggled with setting the ARROW_PRE_0_15_IPC_FORMAT=1 flag as mentioned above successfully.

I set the flag in (1) the command line via export on the head node, (2) via spark-env.sh and yarn-env.sh on all nodes in the cluster, and (3) in the pyspark code itself from my script on the head node. None of these worked to actually set this flag inside of the udf, for unknown reasons.

The simplest solution I found was to call this inside the udf:

    @pandas_udf("integer", PandasUDFType.SCALAR)
    def foo(*args):
        import os
        os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1"
        #...

Hopefully this saves someone else several hours.

answered Oct 21 '22 12:10

K.S.

Related questions
                            
                                Pandas round is not working for DataFrame
                            
                                Python pandas to_excel 'utf8' codec can't decode byte
                            
                                Case insensitive pandas dataframe.merge
                            
                                How to remove or hide x-axis labels from a seaborn / matplotlib plot
                            
                                Hierarchical clustering of time series in Python scipy/numpy/pandas?
                            
                                Dropping Multiple Columns from a data frame using Python
                            
                                Can't replace 0 to nan in Python using Pandas [duplicate]
                            
                                Pandas ImportError: matplotlib is required for plotting
                            
                                transform pandas pivot table to regular dataframe
                            
                                How to use math.log10 function on whole pandas dataframe
                            
                                Pandas: select all dates with specific month and day
                            
                                Pandas dataframe error: matplotlib.axes._subplots.AxesSubplot
                            
                                Iteration over the rows of a Pandas DataFrame as dictionaries
                            
                                Write to StringIO object using Pandas Excelwriter?
                            
                                Dictionary column in pandas dataframe
                            
                                Extract dictionary value from column in data frame
                            
                                Displaying pair plot in Pandas data frame
                            
                                Pandas groupby quantile values
                            
                                if else function in pandas dataframe [duplicate]
                            
                                Python order dataframe alphabetically

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to enable Apache Arrow in Pyspark

Tags:

pandas

pyspark

pyarrow

R. Lamari

People also ask

2 Answers

Wes McKinney

K.S.

Recent Activity

Donate For Us