Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get a range of columns of Spark RDD

Now I have 300+ columns in my RDD, but I found there is a need to dynamically select a range of columns and put them into LabledPoints data type. As a newbie to Spark, I am wondering if there is any index way to select a range of columns in RDD. Something like temp_data = data[, 101:211] in R. Is there something like val temp_data = data.filter(_.column_index in range(101:211)...?

Any thought is welcomed and appreciated.

like image 220
Richard Liu Avatar asked Jul 24 '15 01:07

Richard Liu


People also ask

How do I get a list of columns in Spark DataFrame?

You can get the all columns of a Spark DataFrame by using df. columns , it returns an array of column names as Array[Stirng] .

How do I select multiple columns in Spark?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.

How do I get column values in Spark DataFrame?

In order to convert Spark DataFrame Column to List, first select() the column you want, next use the Spark map() transformation to convert the Row to String, finally collect() the data to the driver which returns an Array[String] .

Does RDD have columns?

RDD- Through RDD, we can process structured as well as unstructured data. But, in RDD user need to specify the schema of ingested data, RDD cannot infer its own. DataFrame- In data frame data is organized into named columns.


3 Answers

If it is a DataFrame, then something like this should work:

val df = rdd.toDF
df.select(df.columns.slice(101,211) : _*)
like image 186
Justin Pihony Avatar answered Sep 30 '22 22:09

Justin Pihony


Assuming you have an RDD of Array or any other scala collection (e.g., List). You can do something like this:

val data: RDD[Array[Int]] = sc.parallelize(Array(Array(1,2,3), Array(4,5,6)))
val sliced: RDD[Array[Int]] = data.map(_.slice(0,2))

sliced.collect()
> Array[Array[Int]] = Array(Array(1, 2), Array(4, 5))
like image 34
marios Avatar answered Sep 30 '22 22:09

marios


Kind of old thread, but I recently had to do something similar and search around. I needed to select all but the last column where I had 200+ columns.

Spark 1.4.1
Scala 2.10.4

val df = hiveContext.sql("SELECT * FROM foobar")
val cols = df.columns.slice(0, df.columns.length - 1)
val new_df = df.select(cols.head, cols.tail:_*)
like image 39
ForRealHomie Avatar answered Sep 30 '22 22:09

ForRealHomie