I am trying to do the following: <pre class="prettyprint"><code>+-----+-------------------------+----------+-------------------------------------------+ |label|features |prediction|probability | +-----+-------------------------+----------+-------------------------------------------+ |0.0 |(3,[],[]) |0 |[0.9999999999999979,2.093996169658831E-15] | |1.0 |(3,[0,1,2],[0.1,0.1,0.1])|0 |[0.999999999999999,9.891337521299582E-16] | |2.0 |(3,[0,1,2],[0.2,0.2,0.2])|0 |[0.9999999999999979,2.0939961696578572E-15]| |3.0 |(3,[0,1,2],[9.0,9.0,9.0])|1 |[2.093996169659668E-15,0.9999999999999979] | |4.0 |(3,[0,1,2],[9.1,9.1,9.1])|1 |[9.89133752128275E-16,0.999999999999999] | |5.0 |(3,[0,1,2],[9.2,9.2,9.2])|1 |[2.0939961696605603E-15,0.9999999999999979]| +-----+-------------------------+----------+-------------------------------------------+ </code></pre> Convert the above dataframe to have two more columns: <code>prob1</code> & <code>prob2</code> Each column having the corresponding values as present in the <code>probability</code> column. I found similar questions - one in PySpark and the other in Scala. I do not know how to translate the PySpark code and I am getting an error with the Scala code. PySpark Code: <pre class="prettyprint"><code>split1_udf = udf(lambda value: value[0].item(), FloatType()) split2_udf = udf(lambda value: value[1].item(), FloatType()) output2 = randomforestoutput.select(split1_udf('probability').alias('c1'), split2_udf('probability').alias('c2')) </code></pre> Or to append these columns to the original dataframe: <pre class="prettyprint"><code>randomforestoutput.withColumn('c1', split1_udf('probability')).withColumn('c2', split2_udf('probability')) </code></pre> Scala Code: <pre class="prettyprint"><code>import org.apache.spark.sql.functions.udf val getPOne = udf((v: org.apache.spark.mllib.linalg.Vector) => v(1)) model.transform(testDf).select(getPOne($"probability")) </code></pre> I get the following error when I run the Scala code: <pre class="prettyprint"><code>scala> predictions.select(getPOne(col("probability"))).show(false) org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(probability)' due to data type mismatch: argument 1 requires vector type, however, '`probability`' is of vector type.;; 'Project [UDF(probability#39) AS UDF(probability)#135] +- Project [label#0, features#1, prediction#34, UDF(features#1) AS probability#39] +- Project [label#0, features#1, UDF(features#1) AS prediction#34] +- Relation[label#0,features#1] libsvm </code></pre> I am currently using Scala 2.11.11 and Spark 2.1.1

What I understand from your question is that you are trying to <code>split</code> <code>probability</code> column into two columns <code>prob1</code> and <code>prob2</code>. If thats the case then a simple <code>array</code> functionality with <code>withColumn</code> should solve your issue. <pre class="prettyprint"><code>predictions .withColumn("prob1", $"probability"(0)) .withColumn("prob2", $"probability"(1)) .drop("probability") </code></pre> You can find more functions that can help you in the future to be applied to <code>dataframes</code>. Edited I created a temp <code>dataframe</code> to match with your <code>column</code> as <pre class="prettyprint"><code>val predictions = Seq(Array(1.0,2.0), Array(2.0939961696605603E-15,0.9999999999999979), Array(Double.NaN,Double.NaN)).toDF("probability") +--------------------------------------------+ |probability | +--------------------------------------------+ |[1.0, 2.0] | |[2.0939961696605603E-15, 0.9999999999999979]| |[NaN, NaN] | +--------------------------------------------+ </code></pre> And I applied the above <code>withColumns</code> which resulted <pre class="prettyprint"><code>+----------------------+------------------+ |prob1 |prob2 | +----------------------+------------------+ |1.0 |2.0 | |2.0939961696605603E-15|0.9999999999999979| |NaN |NaN | +----------------------+------------------+ </code></pre> Schema mismatch Edit Now that since <code>Vector</code> <code>schema</code> of your <code>probability</code> column doesn't match with above solution of <code>arrayType</code> <code>schema</code>, above solution shall not work in your condition. Please use the following solution. You will have to create <code>udf</code> functions and return the value as expected as <pre class="prettyprint"><code> val first = udf((v: Vector) => v.toArray(0)) val second = udf((v: Vector) => v.toArray(1)) predictions .withColumn("prob1", first($"probability")) .withColumn("prob2", second($"probability")) .drop("probability") </code></pre> I hope you get the desired result.

Scala - How to split the probability column (column of vectors) that we obtain when we fit the GMM model to the data in to two separate columns? [duplicate]

Tags:

scala

apache-spark

apache-spark-sql

apache-spark-mllib

I am trying to do the following:

+-----+-------------------------+----------+-------------------------------------------+
|label|features                 |prediction|probability                                |
+-----+-------------------------+----------+-------------------------------------------+
|0.0  |(3,[],[])                |0         |[0.9999999999999979,2.093996169658831E-15] |
|1.0  |(3,[0,1,2],[0.1,0.1,0.1])|0         |[0.999999999999999,9.891337521299582E-16]  |
|2.0  |(3,[0,1,2],[0.2,0.2,0.2])|0         |[0.9999999999999979,2.0939961696578572E-15]|
|3.0  |(3,[0,1,2],[9.0,9.0,9.0])|1         |[2.093996169659668E-15,0.9999999999999979] |
|4.0  |(3,[0,1,2],[9.1,9.1,9.1])|1         |[9.89133752128275E-16,0.999999999999999]   |
|5.0  |(3,[0,1,2],[9.2,9.2,9.2])|1         |[2.0939961696605603E-15,0.9999999999999979]|
+-----+-------------------------+----------+-------------------------------------------+

Convert the above dataframe to have two more columns: prob1 & prob2 Each column having the corresponding values as present in the probability column.

I found similar questions - one in PySpark and the other in Scala. I do not know how to translate the PySpark code and I am getting an error with the Scala code.

PySpark Code:

split1_udf = udf(lambda value: value[0].item(), FloatType())
split2_udf = udf(lambda value: value[1].item(), FloatType())

output2 = randomforestoutput.select(split1_udf('probability').alias('c1'), split2_udf('probability').alias('c2'))

Or to append these columns to the original dataframe:

randomforestoutput.withColumn('c1', split1_udf('probability')).withColumn('c2', split2_udf('probability'))

Scala Code:

import org.apache.spark.sql.functions.udf

val getPOne = udf((v: org.apache.spark.mllib.linalg.Vector) => v(1))
model.transform(testDf).select(getPOne($"probability"))

I get the following error when I run the Scala code:

scala> predictions.select(getPOne(col("probability"))).show(false)
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(probability)' due to data type mismatch: argument 1 requires vector type, however, '`probability`' is of vector type.;;
'Project [UDF(probability#39) AS UDF(probability)#135]
+- Project [label#0, features#1, prediction#34, UDF(features#1) AS probability#39]
   +- Project [label#0, features#1, UDF(features#1) AS prediction#34]
      +- Relation[label#0,features#1] libsvm

I am currently using Scala 2.11.11 and Spark 2.1.1

236

asked Jun 13 '17 21:06

modin3956

1 Answers

What I understand from your question is that you are trying to split probability column into two columns prob1 and prob2. If thats the case then a simple array functionality with withColumn should solve your issue.

predictions
  .withColumn("prob1", $"probability"(0))
  .withColumn("prob2", $"probability"(1))
  .drop("probability")

You can find more functions that can help you in the future to be applied to dataframes.

Edited

I created a temp dataframe to match with your column as

val predictions = Seq(Array(1.0,2.0), Array(2.0939961696605603E-15,0.9999999999999979), Array(Double.NaN,Double.NaN)).toDF("probability")
+--------------------------------------------+
|probability                                 |
+--------------------------------------------+
|[1.0, 2.0]                                  |
|[2.0939961696605603E-15, 0.9999999999999979]|
|[NaN, NaN]                                  |
+--------------------------------------------+

And I applied the above withColumns which resulted

+----------------------+------------------+
|prob1                 |prob2             |
+----------------------+------------------+
|1.0                   |2.0               |
|2.0939961696605603E-15|0.9999999999999979|
|NaN                   |NaN               |
+----------------------+------------------+

Schema mismatch Edit

Now that since Vector schema of your probability column doesn't match with above solution of arrayType schema, above solution shall not work in your condition. Please use the following solution.

You will have to create udf functions and return the value as expected as

   val first = udf((v: Vector) => v.toArray(0))
    val second = udf((v: Vector) => v.toArray(1))
    predictions
      .withColumn("prob1", first($"probability"))
      .withColumn("prob2", second($"probability"))
      .drop("probability")

I hope you get the desired result.

answered Sep 23 '22 16:09

Ramesh Maharjan

Related questions
                            
                                Scaladoc: @group tag not showing in API documentation
                            
                                HashMap in scala.collection.mutable is invariant but immutable.HashMap is covariant, why?
                            
                                How to specify indentations on multiline parameter lists in IntelliJ Scala?
                            
                                Obtaining the client IP in Akka-http
                            
                                What is the difference between Future and future?
                            
                                Scala what is the difference between defining a method in the class instead on the companion object
                            
                                Could not find implicit value while using Context Bound
                            
                                scala case class too many fields
                            
                                How to retrieve the column having datatype as "list" from the table of Cassandra?
                            
                                An object with unapply working in middle of a case statement
                            
                                In Spark Streaming, is there a way to detect when a batch has finished?
                            
                                Scala override method with subclass as parameter type
                            
                                Error using reactivemongo 0.12.1 with play 2.5.X
                            
                                Unable to access file in relative path in Scala for test resource
                            
                                How to construct an actor together with its wrapper?
                            
                                How can I write and read an empty case class with play-json?
                            
                                How to map struct in DataFrame to case class?
                            
                                How to use spark quantilediscretizer on multiple columns
                            
                                Why do I need to use andThen in order to pattern match Futures?
                            
                                Unbounded table is spark structured streaming

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With