Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract a value from a Vector in a column of a Spark Dataframe [duplicate]

When using SparkML to predict labels the result Dataframe is:

scala> result.show
+-----------+--------------+
|probability|predictedLabel|
+-----------+--------------+
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.1,0.9]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.0,1.0]|           0.0|
|  [0.1,0.9]|           0.0|
|  [0.6,0.4]|           1.0|
|  [0.6,0.4]|           1.0|
|  [1.0,0.0]|           1.0|
|  [0.9,0.1]|           1.0|
|  [0.9,0.1]|           1.0|
|  [1.0,0.0]|           1.0|
|  [1.0,0.0]|           1.0|
+-----------+--------------+
only showing top 20 rows

I want to create a new Dataframe with a new column named prob which is the first value from the Vector in probability column of original Dataframe e.g.:

+-----------+--------------+----------+
|probability|predictedLabel|   prob   |
+-----------+--------------+----------+
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.0,1.0]|           0.0|       0.0|
|  [0.1,0.9]|           0.0|       0.1|
|  [0.6,0.4]|           1.0|       0.6|
|  [0.6,0.4]|           1.0|       0.6|
|  [1.0,0.0]|           1.0|       1.0|
|  [0.9,0.1]|           1.0|       0.9|
|  [0.9,0.1]|           1.0|       0.9|
|  [1.0,0.0]|           1.0|       1.0|
|  [1.0,0.0]|           1.0|       1.0|
+-----------+--------------+----------+

How can extract this value into a new column?

like image 571
you zhenghong Avatar asked May 02 '17 06:05

you zhenghong


People also ask

How do I extract values from a column in PySpark?

In PySpark, the substring() function is used to extract the substring from a DataFrame string column by providing the position and length of the string you wanted to extract. In this tutorial, I have explained with an example of getting substring of a column using substring() from pyspark.

How can I find duplicates in spark?

➠ Find complete row duplicates: GroupBy can be used along with count() aggregate function on all the columns (using df. ➠ Find column level duplicates: GroupBy with required columns can be used along with count() aggregate function and then filter can be used to get duplicate records.

How do I remove duplicates in spark DataFrame?

The Spark DataFrame API comes with two functions that can be used in order to remove duplicates from a given DataFrame. These are distinct() and dropDuplicates() .

What is the use of Withcolumn in spark?

Returns a new DataFrame by adding a column or replacing the existing column that has the same name. The column expression must be an expression over this DataFrame ; attempting to add a column from some other DataFrame will raise an error.


Video Answer


1 Answers

You can use the capabilities of Dataset and the wonderful functions library to accomplish what you need:

result.withColumn("prob", $"probability".getItem(0))

This adds a new Column called prob whose value is derived from the probability Column by taking the first item (at index 0--we are computer scientists after all) in the array.

I would mention also that UDFs should be your last resort because the Catalyst optimizer cannot currently optimize UDFs, so you should always prefer the built-in functions to get the most out of Catalyst.

like image 56
Vidya Avatar answered Sep 30 '22 05:09

Vidya