Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert Spark Row to typed Array of Doubles

I am using Spark 1.3.1 with Hive and have a row object that is a long series of doubles to be passed to a Vecors.dense constructor, however when I convert a Row to an array via

SparkDataFrame.map{r => r.toSeq.toArray} 

All type information is lost and I get back an array of [Any] type. I am unable to cast this object to double using

SparkDataFrame.map{r => 
  val array = r.toSeq.toArray 
  array.map(_.toDouble) 
} // Fails with value toDouble is not a member of any

as does

SparkDataFrame.map{r => 
      val array = r.toSeq.toArray 
      array.map(_.asInstanceOf[Double]) 
    } // Fails with java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double 

I see that the Row object has an API that supports getting specific elements as a type, via:

SparkDataFrame.map{r => 
  r.getDouble(5)}  

However event this fails with java.lang.Integer cannot be cast to java.lang.Double

The only work around I have found is the following:

 SparkDataFrame.map{r => 
  doubleArray = Array(r.getInt(5).toDouble, r.getInt(6).toDouble) 
  Vectors.dense(doubleArray) } 

However this is prohibitively tedious when index 5 through 1000 need to be converted to an array of doubles.

Any way around explicitly indexing the row object?

like image 400
user2726995 Avatar asked May 20 '15 15:05

user2726995


1 Answers

Let's look at your code blocks 1 by 1

SparkDataFrame.map{r => 
  val array = r.toSeq.toArray 
  val doubleArra = array.map(_.toDouble) 
} // Fails with value toDouble is not a member of any

Map returns the last statement as the type (i.e. there is a kind of implied return on any function in Scala that the last result is your return value). Your last statement is of type Unit (like Void).. because assigning a variable to a val has no return. To fix that, take out the assignment (this has the side benefit of being less code to read).

SparkDataFrame.map{r => 
  val array = r.toSeq.toArray 
  array.map(_.toDouble) 
} 

_.toDouble is not a cast..you can do it on a String or in your case an Integer, and it will change the instance of the variable type. If you call _.toDouble on a Int, it is more like doing Double.parseDouble(inputInt).

_.asInstanceOf[Double] would be a cast.. which if your data is really a double, would change the type. But not sure you need to cast here, avoid casting if you can.

Update

So you changed the code to this

SparkDataFrame.map{r => 
  val array = r.toSeq.toArray 
  array.map(_.toDouble) 
} // Fails with value toDouble is not a member of any

You are calling toDouble on a node of your SparkDataFrame. Apparently it is not something that has a toDouble method.. i.e. it is not an Int or a String or a Long.

If this works

SparkDataFrame.map{r => 
  doubleArray = Array(r.getInt(5).toDouble, r.getInt(6).toDouble) 
  Vectors.dense(doubleArray) } 

But you need to do from 5 to 1000.. why not do

SparkDataFrame.map{r => 
  val doubleArray = for (i <- 5 to 1000){
      r.getInt(i).toDouble
  }.toArray
  Vectors.dense(doubleArray) 
 } 
like image 65
bwawok Avatar answered Sep 30 '22 06:09

bwawok