Python pandas_udf spark error

Tags:

I started playing around with spark locally and finding this weird issue


    1) pip install pyspark==2.3.1
    2) pyspark>

    import pandas as pd
    from pyspark.sql.functions import pandas_udf, PandasUDFType, udf
    df = pd.DataFrame({'x': [1,2,3], 'y':[1.0,2.0,3.0]})
    sp_df = spark.createDataFrame(df)

    @pandas_udf('long', PandasUDFType.SCALAR)
    def pandas_plus_one(v):
        return v + 1

    sp_df.withColumn('v2', pandas_plus_one(sp_df.x)).show()

Taking this example from here https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

Any idea why I keep getting this error?

py4j.protocol.Py4JJavaError: An error occurred while calling o108.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 8, localhost, executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:333)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:322)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:177)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.(ArrowEvalPythonExec.scala:90)
    at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:88)
    at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:131)
    at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:93)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:158)
    ... 27 more

666

asked Aug 06 '18 18:08

Shrikar

2 Answers

I had the same problem. I found it to be a version problem between pandas and numpy.

For me the following works:

numpy==1.14.5
pandas==0.23.4
pyarrow==0.10.0

before I had the following non working combination:

numpy==1.15.1
pandas==0.23.4
pyarrow==0.10.0

answered Nov 04 '22 04:11

Sebastian

I found the issue to be only an incompatible version of pyarrow. Spark 2.4.0 was built with pyarrow 0.10.0 (https://issues.apache.org/jira/browse/SPARK-23874).

I reverted my pyarrow package to 0.10.0 (current version was 0.15.x) and it worked perfectly.

Config that works for me is..

numpy==1.14.3
pandas==0.23.0
pyarrow==0.10.0

answered Nov 04 '22 03:11

varun

Related questions
                            
                                set y-axis in millions [duplicate]
                            
                                How to convert pandas dataframe so that index is the unique set of values and data is the count of each value?
                            
                                ptrepack sortby needs 'full' index
                            
                                Strip time from an object date in pandas
                            
                                global name 'inf' is not defined
                            
                                Apply Formatting to Each Column in Dataframe Using a Dict Mapping
                            
                                Dropping some columns when using to_csv in pandas
                            
                                Pandas pct change from initial value
                            
                                Read data from OECD API into python (and pandas)
                            
                                Set y-axis scale for pandas Dataframe Boxplot(), 3 Deviations?
                            
                                Charting Candlestick_OHLC one minute bars with Pandas and Matplotlib
                            
                                Warning using Scipy with Pandas
                            
                                looking for an efficient way to iterate
                            
                                Pandas how does IndexSlice work
                            
                                Easy way to distinguish between 0 and False in a dataframe with mixed values
                            
                                python pivot table of counts
                            
                                Create a Dataframe from a Series and a String
                            
                                Get index where value changes in pandas dataframe column
                            
                                What is the difference between pandas dtype vs dtypes
                            
                                pandas merge_asof keys must be sorted error after sorting

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python pandas_udf spark error

Tags:

pandas

apache-spark

pyspark

pyarrow

Shrikar

People also ask

2 Answers

Sebastian

varun

Recent Activity

Donate For Us