I want to transfer my vector to array, so I use
get_array = udf(lambda x: x.toArray(),ArrayType(DoubleType()))
result3 = result2.withColumn('list',get_array('features'))
result3.show()
where the column features
is vector dtype. But Spark tells me that
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
I know the reason must be the type I use in the UDF so I tried get_array = udf(lambda x: x.toArray(),ArrayType(FloatType()))
, which also cannot work.I know it is numpy.narray after transfer, but how can I show it correctly?
Here is the code how I get my dataframe result2:
df4 = indexed.groupBy('uuid').pivot('name').sum('fre')
df4 = df4.fillna(0)
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=df4.columns[1:],
outputCol="features")
dataset = assembler.transform(df4)
bk = BisectingKMeans(k=8, seed=2, featuresCol="features")
result2 = bk.fit(dataset).transform(dataset)
Here is what indexed looks like:
+------------------+------------+---------+-------------+------------+----------+--------+----+
| uuid| category| code| servertime| cat| fre|catIndex|name|
+------------------+------------+---------+-------------+------------+----------+--------+----+
| 351667085527886| 398| null|1503084585000| 398|0.37951264| 2.0| a2|
| 352279079643619| 403| null|1503105476000| 403| 0.3938634| 3.0| a3|
| 352279071621894| 398| null|1503085396000| 398|0.38005984| 2.0| a2|
| 357653074851887| 398| null|1503085552000| 398| 0.3801652| 2.0| a2|
| 354287077780760| 407| null|1503085603000| 407|0.38019964| 5.0| a5|
|0_8f394ebf3f67597c| 403| null|1503084183000| 403|0.37924168| 3.0| a3|
| 353528084062994| 403| null|1503084234000| 403|0.37927604| 3.0| a3|
| 356626072993852| 100000504|100000504|1503104781000| 100000504| 0.3933774| 0.0| a0|
| 351667081062615| 100000448| 398|1503083901000| 398|0.37905172| 2.0| a2|
| 354330089551058|1.00000444E8| null|1503084004000|1.00000444E8|0.37912107| 34.0| a34|
+------------------+------------+---------+-------------+------------+----------+--------+----+
In result2
, I have some columns with type double
, and then I use VectorAssembler
assemble those double columns into a vector features
, which is the column that I want to transfer to array.
Spark split() function to convert string to Array column. Spark SQL provides split() function to convert delimiter separated String to array (StringType to ArrayType) column on Dataframe. This can be done by splitting a string column based on a delimiter like space, comma, pipe e.t.c, and converting into ArrayType.
NumPy types are not supported as the return values for the UserDefinedFunctions
. You have to convert the output to standard Python list
:
udf(lambda x: x.toArray().tolist(), ArrayType(DoubleType()))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With