My row RDD looks like this:
Array[org.apache.spark.sql.Row] = Array([1,[example1,WrappedArray([**Standford,Organisation,NNP], [is,O,VP], [good,LOCATION,ADP**])]])
I have got this from converting dataframe to rdd, dataframe schema was :
root
|-- article_id: long (nullable = true)
|-- sentence: struct (nullable = true)
| |-- sentence: string (nullable = true)
| |-- attributes: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- tokens: string (nullable = true)
| | | |-- ner: string (nullable = true)
| | | |-- pos: string (nullable = true)
Now how do access elements in row rdd, in dataframe I can use df.select("sentence"). I am looking forward to access elements like stanford/other nested elements.
collect. Return a list that contains all of the elements in this RDD. This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
A row in Spark is an ordered collection of fields that can be accessed starting at index 0. The row is a generic object of type Row . Columns making up the row can be of the same or different types.
cogroup() can be used for much more than just implementing joins. We can also use it to implement intersect by key. Additionally, cogroup() can work on three or more RDDs at once.
As @SarveshKumarSingh wrote in a comment you can access a the rows in a RDD[Row]
like you would access any other element in an RDD. Accessing the elements in the row can be done in a couple of ways. Either simply call get
like this:
rowRDD.map(row => row.get(2).asInstanceOf[MyType])
or if it is a build in type, you can avoid the type cast:
rowRDD.map(row => row.getList(4))
or you might want to simply use pattern matching, like:
rowRDD.map{case Row(field1: Long, field2: MyType) => field2}
I hope this helps :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With