Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split column of vectors into two columns?

I use PySpark.

Spark ML's Random Forest output DataFrame has a column "probability" which is a vector with two values. I just want to add two columns to the output DataFrame, "prob1" and "prob2", which correspond to the first and second values in the vector.

I've tried the following:

output2 = output.withColumn('prob1', output.map(lambda r: r['probability'][0]))

but I get the error that 'col should be Column'.

Any suggestions on how to transform a column of vectors into columns of its values?

like image 801
Petrichor Avatar asked May 18 '16 23:05

Petrichor


1 Answers

I figured out the problem with the suggestion above. In pyspark, "dense vectors are simply represented as NumPy array objects", so the issue is with python and numpy types. Need to add .item() to cast a numpy.float64 to a python float.

The following code works:

split1_udf = udf(lambda value: value[0].item(), FloatType())
split2_udf = udf(lambda value: value[1].item(), FloatType())

output2 = randomforestoutput.select(split1_udf('probability').alias('c1'), split2_udf('probability').alias('c2'))

Or to append these columns to the original dataframe:

randomforestoutput.withColumn('c1', split1_udf('probability')).withColumn('c2', split2_udf('probability'))
like image 67
Petrichor Avatar answered Nov 15 '22 09:11

Petrichor