I believe it is related to this one: Spark Error:expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
I have a dataframe
id col_1 col_2
1 [1,2] [1,3]
2 [2,1] [3,4]
I want to create another column that is a cosine
distance between col_1
and col_2
.
from scipy.spatial.distance import cosine
def cosine_distance(a,b):
try:
return cosine(a, b)
except Exception as e:
return 0.0 # in case division by zero
And I defined a udf
:
cosine_distance_udf = udf (cosine_distance, FloatType())
And finally:
new_df = df.withColumn('cosine_distance', cosine_distance_udf('col_1', 'col_2'))
And I have the error: PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
What did I do wrong?
The reason for the error is clear when you check the return type of cosine
:
type(cosine([1, 2], [1, 3]))
# numpy.float64
However, np.float64
is a subclass of float
:
issubclass(np.float64, float)
# True
So, with a small change, to your function,
def cosine_distance(a, b):
try:
return float(cosine(a, b)) # cosine(a, b).item()
except Exception as e:
return 0.0 # in case division by zero
This will work
df.withColumn('cosine_distance', cosine_distance_udf('col_1', 'col_2')).show()
+------+------+---------------+
| col_1| col_2|cosine_distance|
+------+------+---------------+
|[1, 2]|[3, 4]| 0.01613009|
|[2, 1]|[3, 4]| 0.10557281|
+------+------+---------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With