Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to access elemens in Row RDD in SCALA

My row RDD looks like this:

Array[org.apache.spark.sql.Row] = Array([1,[example1,WrappedArray([**Standford,Organisation,NNP], [is,O,VP], [good,LOCATION,ADP**])]])

I have got this from converting dataframe to rdd, dataframe schema was :

root
 |-- article_id: long (nullable = true)
 |-- sentence: struct (nullable = true)
 |    |-- sentence: string (nullable = true)
 |    |-- attributes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- tokens: string (nullable = true)
 |    |    |    |-- ner: string (nullable = true)
 |    |    |    |-- pos: string (nullable = true)

Now how do access elements in row rdd, in dataframe I can use df.select("sentence"). I am looking forward to access elements like stanford/other nested elements.

like image 593
Aayush Rampal Avatar asked Aug 18 '16 05:08

Aayush Rampal


People also ask

What does RDD collect () return?

collect. Return a list that contains all of the elements in this RDD. This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.

What is Row in Spark Scala?

A row in Spark is an ordered collection of fields that can be accessed starting at index 0. The row is a generic object of type Row . Columns making up the row can be of the same or different types.

How many RDDs can Cogroup () can work at once?

cogroup() can be used for much more than just implementing joins. We can also use it to implement intersect by key. Additionally, cogroup() can work on three or more RDDs at once.


1 Answers

As @SarveshKumarSingh wrote in a comment you can access a the rows in a RDD[Row] like you would access any other element in an RDD. Accessing the elements in the row can be done in a couple of ways. Either simply call get like this:

rowRDD.map(row => row.get(2).asInstanceOf[MyType])

or if it is a build in type, you can avoid the type cast:

rowRDD.map(row => row.getList(4))

or you might want to simply use pattern matching, like:

rowRDD.map{case Row(field1: Long, field2: MyType) => field2}

I hope this helps :)

like image 125
Glennie Helles Sindholt Avatar answered Nov 15 '22 03:11

Glennie Helles Sindholt