Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark: get elements of Row by name

In a DataFrame object in Apache Spark (I'm using the Scala interface), if I'm iterating over its Row objects, is there any way to extract values by name? I can see how to do some really awkward stuff:

def foo(r: Row) = {
  val ix = (0 until r.schema.length).map( i => r.schema(i).name -> i).toMap
  val field1 = r.getString(ix("field1"))
  val field2 = r.getLong(ix("field2"))
  ...
}
dataframe.map(foo)

I figure there must be a better way - this is pretty verbose, it requires creating this extra structure, and it also requires knowing the types explicitly, which if incorrect, will produce a runtime exception rather than a compile-time error.

like image 453
Ken Williams Avatar asked Jun 05 '15 19:06

Ken Williams


People also ask

What is org Apache Spark SQL row?

A row in Spark is an ordered collection of fields that can be accessed starting at index 0. The row is a generic object of type Row . Columns making up the row can be of the same or different types.


2 Answers

You can use "getAs" from org.apache.spark.sql.Row

r.getAs("field1")
r.getAs("field2")

Know more about getAs(java.lang.String fieldName)

like image 72
Kexin Nie Avatar answered Oct 09 '22 12:10

Kexin Nie


This is not supported at this time in the Scala API. The closest you have is this JIRA titled "Support converting DataFrames to typed RDDs"

like image 44
Justin Pihony Avatar answered Oct 09 '22 13:10

Justin Pihony