Get a range of columns of Spark RDD

Tags:

Now I have 300+ columns in my RDD, but I found there is a need to dynamically select a range of columns and put them into LabledPoints data type. As a newbie to Spark, I am wondering if there is any index way to select a range of columns in RDD. Something like temp_data = data[, 101:211] in R. Is there something like val temp_data = data.filter(_.column_index in range(101:211)...?

Any thought is welcomed and appreciated.

220

asked Jul 24 '15 01:07

Richard Liu

3 Answers

If it is a DataFrame, then something like this should work:

val df = rdd.toDF
df.select(df.columns.slice(101,211) : _*)

186

answered Sep 30 '22 22:09

Justin Pihony

Assuming you have an RDD of Array or any other scala collection (e.g., List). You can do something like this:

val data: RDD[Array[Int]] = sc.parallelize(Array(Array(1,2,3), Array(4,5,6)))
val sliced: RDD[Array[Int]] = data.map(_.slice(0,2))

sliced.collect()
> Array[Array[Int]] = Array(Array(1, 2), Array(4, 5))

answered Sep 30 '22 22:09

marios

Kind of old thread, but I recently had to do something similar and search around. I needed to select all but the last column where I had 200+ columns.

Spark 1.4.1
Scala 2.10.4

val df = hiveContext.sql("SELECT * FROM foobar")
val cols = df.columns.slice(0, df.columns.length - 1)
val new_df = df.select(cols.head, cols.tail:_*)

answered Sep 30 '22 22:09

ForRealHomie

Related questions
                            
                                Spark processing columns in parallel
                            
                                Differences between scala and java enumerations
                            
                                Shapeless: Inversion of filterNot on an HList
                            
                                Scala type inference: can't infer IndexedSeq[T] from Array[T]
                            
                                SBT Assembly - Deduplicate error & Exclude error
                            
                                Scala type inference working with Slick Table
                            
                                Reuse instances of case classes
                            
                                How to use Applicative for concurrency?
                            
                                Controlling requests per second and timeout threshold in Gatling
                            
                                Domain routes in Play Framework 2.3
                            
                                "unlist" in scala (e.g flattening a sequence of sequences of sequences...)
                            
                                scalaz-stream: how to handle the "header" (first chunks) in a different way to the rest?
                            
                                Getting all enum-y case object values in Scala
                            
                                Howto create a .jar including both sources (.java and .scala) and classes with sbt?
                            
                                how to collect spark sql output to a file?
                            
                                Scala: Use case for Map.flatten?
                            
                                Iterator behaviour in flink reduceGroup
                            
                                Java / Scala Future driven by a callback
                            
                                Equivalent of Distributed Cache in Spark? [duplicate]
                            
                                Split an iterator by a predicate

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get a range of columns of Spark RDD

Tags:

scala

apache-spark

rdd

Richard Liu

People also ask

3 Answers

Justin Pihony

marios

ForRealHomie

Recent Activity

Donate For Us