I have a dataframe in Spark in which one of the columns contains an array.Now,I have written a separate UDF which converts the array to another array with distinct values in it only. See example below: Ex: [24,23,27,23] should get converted to [24, 23, 27] Code: <pre class="prettyprint"><code>def uniq_array(col_array): x = np.unique(col_array) return x uniq_array_udf = udf(uniq_array,ArrayType(IntegerType())) Df3 = Df2.withColumn("age_array_unique",uniq_array_udf(Df2.age_array)) </code></pre> In the above code, <code>Df2.age_array</code> is the array on which I am applying the UDF to get a different column <code>"age_array_unique"</code> which should contain only unique values in the array. However, as soon as I run the command <code>Df3.show()</code>, I get the error: <blockquote> net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct) </blockquote> Can anyone please let me know why this is happening? Thanks!

The source of the problem is that object returned from the UDF doesn't conform to the declared type. <code>np.unique</code> not only returns <code>numpy.ndarray</code> but also converts numerics to the corresponding <code>NumPy</code> types which are not compatible with <code>DataFrame</code> API. You can try something like this: <pre class="prettyprint"><code>udf(lambda x: list(set(x)), ArrayType(IntegerType())) </code></pre> or this (to keep order) <pre class="prettyprint"><code>udf(lambda xs: list(OrderedDict((x, None) for x in xs)), ArrayType(IntegerType())) </code></pre> instead. If you really want <code>np.unique</code> you have to convert the output: <pre class="prettyprint"><code>udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType())) </code></pre>

You need to convert the final value to a python list. You implement the function as follows: <pre class="prettyprint"><code>def uniq_array(col_array): x = np.unique(col_array) return list(x) </code></pre> This is because Spark doesn't understand the numpy array format. In order to feed a python object that Spark DataFrames understand as an <code>ArrayType</code>, you need to convert the output to a python <code>list</code> before returning it.

Spark Error:expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

Tags:

arrays

apache-spark

apache-spark-sql

pyspark

user-defined-functions

I have a dataframe in Spark in which one of the columns contains an array.Now,I have written a separate UDF which converts the array to another array with distinct values in it only. See example below:

Ex: [24,23,27,23] should get converted to [24, 23, 27] Code:

def uniq_array(col_array):     x = np.unique(col_array)     return x uniq_array_udf = udf(uniq_array,ArrayType(IntegerType()))  Df3 = Df2.withColumn("age_array_unique",uniq_array_udf(Df2.age_array))

In the above code, Df2.age_array is the array on which I am applying the UDF to get a different column "age_array_unique" which should contain only unique values in the array.

However, as soon as I run the command Df3.show(), I get the error:

net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

Can anyone please let me know why this is happening?

Thanks!

417

asked Aug 16 '16 21:08

Preyas

2 Answers

The source of the problem is that object returned from the UDF doesn't conform to the declared type. np.unique not only returns numpy.ndarray but also converts numerics to the corresponding NumPy types which are not compatible with DataFrame API. You can try something like this:

udf(lambda x: list(set(x)), ArrayType(IntegerType()))

or this (to keep order)

udf(lambda xs: list(OrderedDict((x, None) for x in xs)),      ArrayType(IntegerType()))

instead.

If you really want np.unique you have to convert the output:

udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))

answered Sep 25 '22 12:09

zero323

You need to convert the final value to a python list. You implement the function as follows:

def uniq_array(col_array):     x = np.unique(col_array)     return list(x)

This is because Spark doesn't understand the numpy array format. In order to feed a python object that Spark DataFrames understand as an ArrayType, you need to convert the output to a python list before returning it.

answered Sep 25 '22 12:09

user1632287

Related questions
                            
                                Find Duplicate Elements In Array Using Swift
                            
                                Convert array into csv
                            
                                how to change the array key to start from 1 instead of 0
                            
                                How to extract specific array keys and values to another array?
                            
                                Appending a byte[] to the end of another byte[] [duplicate]
                            
                                which is best array_search or in_array?
                            
                                Sort array by date gives unexpected results
                            
                                PHP order array by date? [duplicate]
                            
                                Swapping two items in a javascript array [duplicate]
                            
                                Unexpected behaviour of current() in a foreach loop [duplicate]
                            
                                How does newly introduced Arrays.parallelPrefix(...) in Java 8 work?
                            
                                What is the purpose of the "volatile" keyword appearing inside an array subscript?
                            
                                How to set an Arrays internal pointer to a specific position? PHP/XML
                            
                                Filtering JSON array using jQuery grep()
                            
                                How to implement the Hashable Protocol in Swift for an Int array (a custom string struct)
                            
                                How is length implemented in Java Arrays?
                            
                                Why isn't there a java.lang.Array class? If a java array is an Object, shouldn't it extend Object?
                            
                                How to sort an array of objects by date?
                            
                                How to generate a sequence of numbers
                            
                                PostgresQL SQL: Converting results to array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With