I am using Spark 1.3.1 with Hive and have a row object that is a long series of doubles to be passed to a Vecors.dense constructor, however when I convert a Row to an array via
SparkDataFrame.map{r => r.toSeq.toArray}
All type information is lost and I get back an array of [Any] type. I am unable to cast this object to double using
SparkDataFrame.map{r =>
val array = r.toSeq.toArray
array.map(_.toDouble)
} // Fails with value toDouble is not a member of any
as does
SparkDataFrame.map{r =>
val array = r.toSeq.toArray
array.map(_.asInstanceOf[Double])
} // Fails with java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double
I see that the Row object has an API that supports getting specific elements as a type, via:
SparkDataFrame.map{r =>
r.getDouble(5)}
However event this fails with java.lang.Integer cannot be cast to java.lang.Double
The only work around I have found is the following:
SparkDataFrame.map{r =>
doubleArray = Array(r.getInt(5).toDouble, r.getInt(6).toDouble)
Vectors.dense(doubleArray) }
However this is prohibitively tedious when index 5 through 1000 need to be converted to an array of doubles.
Any way around explicitly indexing the row object?
Let's look at your code blocks 1 by 1
SparkDataFrame.map{r =>
val array = r.toSeq.toArray
val doubleArra = array.map(_.toDouble)
} // Fails with value toDouble is not a member of any
Map returns the last statement as the type (i.e. there is a kind of implied return on any function in Scala that the last result is your return value). Your last statement is of type Unit (like Void).. because assigning a variable to a val has no return. To fix that, take out the assignment (this has the side benefit of being less code to read).
SparkDataFrame.map{r =>
val array = r.toSeq.toArray
array.map(_.toDouble)
}
_.toDouble
is not a cast..you can do it on a String or in your case an Integer, and it will change the instance of the variable type. If you call _.toDouble
on a Int, it is more like doing Double.parseDouble(inputInt)
.
_.asInstanceOf[Double]
would be a cast.. which if your data is really a double, would change the type. But not sure you need to cast here, avoid casting if you can.
Update
So you changed the code to this
SparkDataFrame.map{r =>
val array = r.toSeq.toArray
array.map(_.toDouble)
} // Fails with value toDouble is not a member of any
You are calling toDouble on a node of your SparkDataFrame. Apparently it is not something that has a toDouble method.. i.e. it is not an Int or a String or a Long.
If this works
SparkDataFrame.map{r =>
doubleArray = Array(r.getInt(5).toDouble, r.getInt(6).toDouble)
Vectors.dense(doubleArray) }
But you need to do from 5 to 1000.. why not do
SparkDataFrame.map{r =>
val doubleArray = for (i <- 5 to 1000){
r.getInt(i).toDouble
}.toArray
Vectors.dense(doubleArray)
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With