I want to transfer my vector to array, so I use <pre class="prettyprint"><code>get_array = udf(lambda x: x.toArray(),ArrayType(DoubleType())) result3 = result2.withColumn('list',get_array('features')) result3.show() </code></pre> where the column <code>features</code> is vector dtype. But Spark tells me that <pre class="prettyprint"><code> net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct) </code></pre> I know the reason must be the type I use in the UDF so I tried <code>get_array = udf(lambda x: x.toArray(),ArrayType(FloatType()))</code>, which also cannot work.I know it is numpy.narray after transfer, but how can I show it correctly? Here is the code how I get my dataframe result2: <pre class="prettyprint"><code>df4 = indexed.groupBy('uuid').pivot('name').sum('fre') df4 = df4.fillna(0) from pyspark.ml.feature import VectorAssembler assembler = VectorAssembler( inputCols=df4.columns[1:], outputCol="features") dataset = assembler.transform(df4) bk = BisectingKMeans(k=8, seed=2, featuresCol="features") result2 = bk.fit(dataset).transform(dataset) </code></pre> Here is what indexed looks like: <pre class="prettyprint lang-none prettyprint-override"><code>+------------------+------------+---------+-------------+------------+----------+--------+----+ | uuid| category| code| servertime| cat| fre|catIndex|name| +------------------+------------+---------+-------------+------------+----------+--------+----+ | 351667085527886| 398| null|1503084585000| 398|0.37951264| 2.0| a2| | 352279079643619| 403| null|1503105476000| 403| 0.3938634| 3.0| a3| | 352279071621894| 398| null|1503085396000| 398|0.38005984| 2.0| a2| | 357653074851887| 398| null|1503085552000| 398| 0.3801652| 2.0| a2| | 354287077780760| 407| null|1503085603000| 407|0.38019964| 5.0| a5| |0_8f394ebf3f67597c| 403| null|1503084183000| 403|0.37924168| 3.0| a3| | 353528084062994| 403| null|1503084234000| 403|0.37927604| 3.0| a3| | 356626072993852| 100000504|100000504|1503104781000| 100000504| 0.3933774| 0.0| a0| | 351667081062615| 100000448| 398|1503083901000| 398|0.37905172| 2.0| a2| | 354330089551058|1.00000444E8| null|1503084004000|1.00000444E8|0.37912107| 34.0| a34| +------------------+------------+---------+-------------+------------+----------+--------+----+ </code></pre> In <code>result2</code>, I have some columns with type <code>double</code>, and then I use <code>VectorAssembler</code> assemble those double columns into a vector <code>features</code>, which is the column that I want to transfer to array.

NumPy types are not supported as the return values for the <code>UserDefinedFunctions</code>. You have to convert the output to standard Python <code>list</code>: <pre class="prettyprint"><code>udf(lambda x: x.toArray().tolist(), ArrayType(DoubleType())) </code></pre>

What type should it be , after using .toArray() for a Spark vector?

Tags:

python

numpy

apache-spark

apache-spark-sql

pyspark

I want to transfer my vector to array, so I use

get_array = udf(lambda x: x.toArray(),ArrayType(DoubleType()))
result3 = result2.withColumn('list',get_array('features'))
result3.show()

where the column features is vector dtype. But Spark tells me that

 net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

I know the reason must be the type I use in the UDF so I tried get_array = udf(lambda x: x.toArray(),ArrayType(FloatType())), which also cannot work.I know it is numpy.narray after transfer, but how can I show it correctly?

Here is the code how I get my dataframe result2:

df4 = indexed.groupBy('uuid').pivot('name').sum('fre')
df4 = df4.fillna(0)
from pyspark.ml.feature import VectorAssembler 
assembler = VectorAssembler(
    inputCols=df4.columns[1:],
    outputCol="features")
dataset = assembler.transform(df4)
bk = BisectingKMeans(k=8,  seed=2, featuresCol="features")
result2 = bk.fit(dataset).transform(dataset)

Here is what indexed looks like:

+------------------+------------+---------+-------------+------------+----------+--------+----+
|              uuid|    category|     code|   servertime|         cat|       fre|catIndex|name|
+------------------+------------+---------+-------------+------------+----------+--------+----+
|   351667085527886|         398|     null|1503084585000|         398|0.37951264|     2.0|  a2|
|   352279079643619|         403|     null|1503105476000|         403| 0.3938634|     3.0|  a3|
|   352279071621894|         398|     null|1503085396000|         398|0.38005984|     2.0|  a2|
|   357653074851887|         398|     null|1503085552000|         398| 0.3801652|     2.0|  a2|
|   354287077780760|         407|     null|1503085603000|         407|0.38019964|     5.0|  a5|
|0_8f394ebf3f67597c|         403|     null|1503084183000|         403|0.37924168|     3.0|  a3|
|   353528084062994|         403|     null|1503084234000|         403|0.37927604|     3.0|  a3|
|   356626072993852|   100000504|100000504|1503104781000|   100000504| 0.3933774|     0.0|  a0|
|   351667081062615|   100000448|      398|1503083901000|         398|0.37905172|     2.0|  a2|
|   354330089551058|1.00000444E8|     null|1503084004000|1.00000444E8|0.37912107|    34.0| a34|
+------------------+------------+---------+-------------+------------+----------+--------+----+

In result2, I have some columns with type double, and then I use VectorAssembler assemble those double columns into a vector features, which is the column that I want to transfer to array.

649

asked Aug 25 '17 03:08

nick_liu

1 Answers

NumPy types are not supported as the return values for the UserDefinedFunctions. You have to convert the output to standard Python list:

udf(lambda x: x.toArray().tolist(), ArrayType(DoubleType()))

110

answered Sep 28 '22 17:09

zero323

Related questions
                            
                                SQLAlchemy, prevent duplicate rows
                            
                                can't include Python.h in visual studio
                            
                                Is it safe to store per-request data on flask.request?
                            
                                Catch exception thrown in generator caller in Python
                            
                                How to revert changes in Pycharm
                            
                                Resampling a pandas dataframe with multi-index containing timeseries
                            
                                python: why does random.shuffle change the array
                            
                                Calling base class method after child class __init__ from base class __init__?
                            
                                Pythonic way to print 2D list -- Python
                            
                                Scatter plot on large amount of data
                            
                                Appending rows in excel xlswriter
                            
                                Regular expression: matching words between white space
                            
                                Why does Python2.7 dict use more space than Python3 dict?
                            
                                Python Windows Authentication username and password is not working
                            
                                Expected shape (None, 8) but got array with shape (8,1)
                            
                                Multi processing code repeatedly runs
                            
                                How do you find nodes with no outgoing edges in networkx?
                            
                                Python .NET, multithreading and the windows event loop
                            
                                How to overlay plots from different cells?
                            
                                Seaborn FacetGrid PointPlot Label Data Points

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With