I have a dataframe with two columns one of which (called dist) is a dense vector. How can I convert it back into an array column of integers. <pre class="prettyprint"><code>+---+-----+ | id| dist| +---+-----+ |1.0|[2.0]| |2.0|[4.0]| |3.0|[6.0]| |4.0|[8.0]| +---+-----+ </code></pre> I tried using several variants of the following udf but it returns a type mismatch error <pre class="prettyprint lang-scala prettyprint-override"><code>val toInt4 = udf[Int, Vector]({ (a) => (a)}) val result = df.withColumn("dist", toDf4(df("dist"))).select("dist") </code></pre>

I struggled for a while to get the answer from @ThomasLuechtefeld working. But was running into this very frustrating error: <pre class="prettyprint"><code>org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(features_scaled)' due to data type mismatch: argument 1 requires vector type, however, '`features_scaled`' is of vector type. </code></pre> Turns out I needed to import DenseVector from the ml package instead of the mllib package. So this worked for me: <pre class="prettyprint"><code>import org.apache.spark.ml.linalg.DenseVector import org.apache.spark.sql.functions._ val vectorToColumn = udf{ (x:DenseVector, index: Int) => x(index) } myDataframe.withColumn("clusters_scaled",vectorToColumn(col("features_scaled"),lit(0))) </code></pre> Yes, the only difference is that first line. This should absolutely be a comment, but I don't have the reputation. Sorry!

Converting a vector column in a dataframe back into an array column

Tags:

scala

apache-spark

apache-spark-mllib

I have a dataframe with two columns one of which (called dist) is a dense vector. How can I convert it back into an array column of integers.

+---+-----+
| id| dist| 
+---+-----+
|1.0|[2.0]|
|2.0|[4.0]|
|3.0|[6.0]|
|4.0|[8.0]|
+---+-----+

I tried using several variants of the following udf but it returns a type mismatch error

val toInt4 = udf[Int, Vector]({ (a) => (a)})  

val result = df.withColumn("dist", toDf4(df("dist"))).select("dist")

893

asked Mar 07 '16 22:03

ulrich

3 Answers

I struggled for a while to get the answer from @ThomasLuechtefeld working. But was running into this very frustrating error:

org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(features_scaled)' due to data type mismatch: argument 1 requires vector type, however, '`features_scaled`' is of vector type.

Turns out I needed to import DenseVector from the ml package instead of the mllib package.

So this worked for me:

import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.sql.functions._

val vectorToColumn = udf{ (x:DenseVector, index: Int) => x(index) }
myDataframe.withColumn("clusters_scaled",vectorToColumn(col("features_scaled"),lit(0)))

Yes, the only difference is that first line. This should absolutely be a comment, but I don't have the reputation. Sorry!

answered Nov 14 '22 20:11

pwb2103

I think it's easiest to do it by going to the RDD API and then back.

import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.sql.DataFrame
import org.apache.spark.rdd.RDD
import sqlContext._

// The original data.
val input: DataFrame =
  sc.parallelize(1 to 4)
    .map(i => i.toDouble -> new DenseVector(Array(i.toDouble * 2)))
    .toDF("id", "dist")

// Turn it into an RDD for manipulation.
val inputRDD: RDD[(Double, DenseVector)] =
  input.map(row => row.getAs[Double]("id") -> row.getAs[DenseVector]("dist"))

// Change the DenseVector into an integer array.
val outputRDD: RDD[(Double, Array[Int])] =
  inputRDD.mapValues(_.toArray.map(_.toInt))

// Go back to a DataFrame.
val output = outputRDD.toDF("id", "dist")
output.show

You get:

+---+----+
| id|dist|
+---+----+
|1.0| [2]|
|2.0| [4]|
|3.0| [6]|
|4.0| [8]|
+---+----+

answered Nov 14 '22 21:11

Daniel Darabos

In spark 2.0 you can do something like:

import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.sql.functions.udf

val vectorHead = udf{ x:DenseVector => x(0) }
df.withColumn("firstValue", vectorHead(df("vectorColumn")))

answered Nov 14 '22 21:11

Thomas Luechtefeld

Related questions
                            
                                Scala: Can an abstract type be subtype of more than one other type?
                            
                                scala: Create tuple from java?
                            
                                Scala List Comprehensions
                            
                                How to use an autoincrement index in for comprehension in Scala
                            
                                When are parenthesis required (or prohibited) on methods? [duplicate]
                            
                                How can I find the version of Scala installed in Eclipse IDE?
                            
                                What does the operator #:: mean?
                            
                                Scala List and Subtypes
                            
                                How can I interpret escape sequences in a multiline scala string?
                            
                                Casting java.util.LinkedHashMap to scala.collection.mutable.Map
                            
                                what's the difference between Seq and Set in Scala
                            
                                Scala Convert a string into a map
                            
                                Sender becomes Actor[akka://Main/deadLetters]
                            
                                Partial views in Play exist?
                            
                                How to do pointfree style with long parameter list
                            
                                Implicit conversion between Scala.Long and Java.lang.Long in collections
                            
                                take top N after groupBy and treat them as RDD
                            
                                How to solve type mismatch when compiler finds Serializable instead of the match type?
                            
                                Making Reads and Writes in Scala Play for lists of custom classes
                            
                                Passing additional arguments to tests with ScalaTest

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With