How to use Pandas UDFs on macOS Mojave? (that fails due to [__NSPlaceholderDictionary initialize] may have been in progress...)

Tags:

I'm trying to use Pandas UDFs (a.k.a. Vectorized UDFs) in Apache Spark 2.4.0 on macOS 10.14.3 (macOS Mojave).

I installed pandas and pyarrow using pip (and later pip3).

Whenever I execute the sample code from the official documentation of Spark SQL I get the following exception.

import pandas as pd

from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType

def multiply_func(a, b):
    return a * b

multiply = pandas_udf(multiply_func, returnType=LongType())

x = pd.Series([1, 2, 3])
print(multiply_func(x, x))
df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))

# Execute function as a Spark vectorized UDF
df.select(multiply(col("x"), col("x"))).show()

The exception is as follows:

objc[97883]: +[__NSPlaceholderDictionary initialize] may have been in progress in another thread when fork() was called.
objc[97883]: +[__NSPlaceholderDictionary initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
19/03/27 15:01:20 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:486)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:475)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:34)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:178)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:98)
    at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:96)
    at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:128)
    ...
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:159)
    ... 28 more

798

asked Mar 27 '19 14:03

Jacek Laskowski

1 Answers

I found a solution in Doesn't work on macOS High Sierra #69 and thought I'd post it on StackOverflow.

You should make sure that Xcode's command line tools are already installed. If not, execute the following:

xcode-select --install

What turned out very important was to export OBJC_DISABLE_INITIALIZE_FORK_SAFETY env var:

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

With the two above the code worked fine:

>>> # Execute function as a Spark vectorized UDF
... df.select(multiply(col("x"), col("x"))).show()
[Stage 0:>                                                          (0 + 1) / 1]/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
+-------------------+
|multiply_func(x, x)|
+-------------------+
|                  1|
|                  4|
|                  9|
+-------------------+

answered Oct 20 '22 12:10

Jacek Laskowski

Related questions
                            
                                Spark DataFrame operators (nunique, multiplication)
                            
                                Is it possible to print definition of a function in Scala
                            
                                read/write dynamo db from apache spark [closed]
                            
                                java.lang.IllegalArgumentException: Invalid lambda deserialization
                            
                                Pyspark Dataframe - Map Strings to Numerics
                            
                                After installing sparknlp, cannot import sparknlp
                            
                                How to achieve dynamic load-balancing of tasks in Apache Spark
                            
                                How to calculate the power of 2 for the column of DataFrame
                            
                                Can num-executors override dynamic allocation in spark-submit
                            
                                why does spark appends 'WHERE 1=0' at the end of sql query
                            
                                Save the parquet output file with fixed size in spark
                            
                                value toDF is not a member of Seq[(Int,String)]
                            
                                Spark's .count() function is different to the contents of the dataframe when filtering on corrupt record field
                            
                                How do I groupby and concat a list in a Dataframe Spark Scala
                            
                                Spark & Scala: saveAsTextFile() exception
                            
                                What does pyspark need psutil for? (faced "UserWarning: Please install psutil to have better support with spilling")?
                            
                                Spark Structured Streaming MemoryStream + Row + Encoders issue
                            
                                'CrossValidatorModel' object has no attribute 'featureImportances'
                            
                                contains pyspark SQL: TypeError: 'Column' object is not callable
                            
                                Writing Spark DataFrame to Hive table through AWS Glue Data Cataloug

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use Pandas UDFs on macOS Mojave? (that fails due to [__NSPlaceholderDictionary initialize] may have been in progress...)

Tags:

apache-spark

pyspark

pyspark-sql

pyarrow

Jacek Laskowski

People also ask

1 Answers

Jacek Laskowski

Recent Activity

Donate For Us