Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract information from a `org.apache.spark.sql.Row`

I have Array[org.apache.spark.sql.Row] returned by sqc.sql(sqlcmd).collect():

Array([10479,6,10], [8975,149,640], ...)

I can get the individual values:

scala> pixels(0)(0)
res34: Any = 10479

but they are Any, not Int.

How do I extract them as Int?

The most obvious solution did not work:

scala> pixels(0).getInt(0)
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Int

PS. I can do pixels(0)(0).toString.toInt or pixels(0).getString(0).toInt, but they feel wrong...

like image 385
sds Avatar asked Jan 20 '15 00:01

sds


People also ask

How do I select rows in Spark?

Selecting rows using the filter() function The first option you have when it comes to filtering DataFrame rows is pyspark. sql. DataFrame. filter() function that performs filtering based on the specified conditions.

How do I get data from Spark DataFrame?

After creating the Dataframe, for retrieving all the data from the dataframe we have used the collect() action by writing df. collect(), this will return the Array of row type, in the below output shows the schema of the dataframe and the actual created Dataframe.


2 Answers

Using getInt should work. Here is a contrived example as a proof of concept

import org.apache.spark.sql._
sc.parallelize(Array(1,2,3)).map(Row(_)).collect()(0).getInt(0)

This return 1

However,

sc.parallelize(Array("1","2","3")).map(Row(_)).collect()(0).getInt(0)

fails. So, it looks like it is coming in as a string and you will have to convert to an int manually.

sc.parallelize(Array("1","2","3")).map(Row(_)).collect()(0).getString(0).toInt

The documentation states that getInt:

Returns the value of column i as an int. This function will throw an exception if the value is at i is not an integer, or if it is null.

So, it will not try to cast for you it seems

like image 148
Justin Pihony Avatar answered Sep 25 '22 07:09

Justin Pihony


The Row class (also see https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.package) has methods getInt(i: Int), getDouble(i: Int) etc.

Also note that a SchemaRDD is an RDD[Row] plus a schema that tells you which column has which data type. If you do .collect() you will only get an Array[Row] which does not have that information. So unless you know for sure what your data looks like, get the schema from the SchemaRDD, then collect the rows and then access each field using the correct type information.

like image 41
tgpfeiffer Avatar answered Sep 25 '22 07:09

tgpfeiffer