Access Array column in Spark

Tags:

A Spark DataFrame contains a column of type Array[Double]. It throw a ClassCastException exception when I try to get it back in a map() function. The following Scala code generate an exception.

case class Dummy( x:Array[Double] )
val df = sqlContext.createDataFrame(Seq(Dummy(Array(1,2,3))))
val s = df.map( r => {
   val arr:Array[Double] = r.getAs[Array[Double]]("x")
   arr.sum
})
s.foreach(println)

The exception is

java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [D
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:24)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:23)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:890)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:890)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Cam somebody explain me why it does not work? what should I do instead? I am using Spark 1.5.1 and scala 2.10.6

Thanks

368

asked Oct 28 '15 12:10

Boris

1 Answers

ArrayType is represented in a Row as a scala.collection.mutable.WrappedArray. You can extract it using for example

val arr: Seq[Double] = r.getAs[Seq[Double]]("x")

val i: Int = ???
val arr = r.getSeq[Double](i)

or even:

import scala.collection.mutable.WrappedArray

val arr: WrappedArray[Double] = r.getAs[WrappedArray[Double]]("x")

If DataFrame is relatively thin then pattern matching could be a better approach:

import org.apache.spark.sql.Row

df.rdd.map{case Row(x: Seq[Double]) => (x.toArray, x.sum)}

although you have to keep in mind that the type of the sequence is unchecked.

In Spark >= 1.6 you can also use Dataset as follows:

df.select("x").as[Seq[Double]].rdd

109

answered Sep 28 '22 05:09

zero323

Related questions
                            
                                Clean solution to this ruby iterator trickiness?
                            
                                Default values in array
                            
                                Storing PHP Arrays in Wordpress User Meta Database
                            
                                How to get all elements from index N to the end from an anonymous Perl array?
                            
                                Array of size defined by not constant variable
                            
                                Array_Map using multiple native callbacks?
                            
                                Contiguous memory storing misunderstanding in.net?
                            
                                Creating a large array of numbers (10^9 size)
                            
                                Sort array based on count of occurrences in ascending order
                            
                                Reference detection in array from another function
                            
                                Create Google Chart data table array from two arrays
                            
                                Why java does not autobox int[] to Integer[]
                            
                                Is it faster (or better) to declare an array inline in Java?
                            
                                Groovy Simple JSON array builder
                            
                                Double brace initializer and array [duplicate]
                            
                                how to convert an integer string separated by space into an array in JAVA
                            
                                The compiler complains with "Error: stray '\240' in program"
                            
                                array_keys vs foreach in getting all keys
                            
                                Ruby: Remove first and last element of an Array - why the solution works one way & not the other
                            
                                How to write Python Array into Excel Spread sheet

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Access Array column in Spark

Tags:

arrays

scala

apache-spark

apache-spark-sql

classcastexception

Boris

People also ask

1 Answers

zero323

Recent Activity

Donate For Us