Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What type should it be , after using .toArray() for a Spark vector?

I want to transfer my vector to array, so I use

get_array = udf(lambda x: x.toArray(),ArrayType(DoubleType()))
result3 = result2.withColumn('list',get_array('features'))
result3.show()

where the column features is vector dtype. But Spark tells me that

 net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

I know the reason must be the type I use in the UDF so I tried get_array = udf(lambda x: x.toArray(),ArrayType(FloatType())), which also cannot work.I know it is numpy.narray after transfer, but how can I show it correctly?

Here is the code how I get my dataframe result2:

df4 = indexed.groupBy('uuid').pivot('name').sum('fre')
df4 = df4.fillna(0)
from pyspark.ml.feature import VectorAssembler 
assembler = VectorAssembler(
    inputCols=df4.columns[1:],
    outputCol="features")
dataset = assembler.transform(df4)
bk = BisectingKMeans(k=8,  seed=2, featuresCol="features")
result2 = bk.fit(dataset).transform(dataset)

Here is what indexed looks like:

+------------------+------------+---------+-------------+------------+----------+--------+----+
|              uuid|    category|     code|   servertime|         cat|       fre|catIndex|name|
+------------------+------------+---------+-------------+------------+----------+--------+----+
|   351667085527886|         398|     null|1503084585000|         398|0.37951264|     2.0|  a2|
|   352279079643619|         403|     null|1503105476000|         403| 0.3938634|     3.0|  a3|
|   352279071621894|         398|     null|1503085396000|         398|0.38005984|     2.0|  a2|
|   357653074851887|         398|     null|1503085552000|         398| 0.3801652|     2.0|  a2|
|   354287077780760|         407|     null|1503085603000|         407|0.38019964|     5.0|  a5|
|0_8f394ebf3f67597c|         403|     null|1503084183000|         403|0.37924168|     3.0|  a3|
|   353528084062994|         403|     null|1503084234000|         403|0.37927604|     3.0|  a3|
|   356626072993852|   100000504|100000504|1503104781000|   100000504| 0.3933774|     0.0|  a0|
|   351667081062615|   100000448|      398|1503083901000|         398|0.37905172|     2.0|  a2|
|   354330089551058|1.00000444E8|     null|1503084004000|1.00000444E8|0.37912107|    34.0| a34|
+------------------+------------+---------+-------------+------------+----------+--------+----+

In result2, I have some columns with type double, and then I use VectorAssembler assemble those double columns into a vector features, which is the column that I want to transfer to array.

like image 649
nick_liu Avatar asked Aug 25 '17 03:08

nick_liu


People also ask

How to convert string into Array in Spark?

Spark split() function to convert string to Array column. Spark SQL provides split() function to convert delimiter separated String to array (StringType to ArrayType) column on Dataframe. This can be done by splitting a string column based on a delimiter like space, comma, pipe e.t.c, and converting into ArrayType.


1 Answers

NumPy types are not supported as the return values for the UserDefinedFunctions. You have to convert the output to standard Python list:

udf(lambda x: x.toArray().tolist(), ArrayType(DoubleType()))
like image 110
zero323 Avatar answered Sep 28 '22 17:09

zero323