I have <code>Array[org.apache.spark.sql.Row]</code> returned by <code>sqc.sql(sqlcmd).collect()</code>: <pre class="prettyprint"><code>Array([10479,6,10], [8975,149,640], ...) </code></pre> I can get the individual values: <pre class="prettyprint"><code>scala> pixels(0)(0) res34: Any = 10479 </code></pre> but they are <code>Any</code>, not <code>Int</code>. How do I extract them as <code>Int</code>? The most obvious solution did not work: <pre class="prettyprint"><code>scala> pixels(0).getInt(0) java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Int </code></pre> PS. I can do <code>pixels(0)(0).toString.toInt</code> or <code>pixels(0).getString(0).toInt</code>, but they feel wrong...

The <code>Row</code> class (also see https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.package) has methods <code>getInt(i: Int)</code>, <code>getDouble(i: Int)</code> etc. Also note that a <code>SchemaRDD</code> is an <code>RDD[Row]</code> plus a <code>schema</code> that tells you which column has which data type. If you do <code>.collect()</code> you will only get an <code>Array[Row]</code> which does not have that information. So unless you know for sure what your data looks like, get the schema from the <code>SchemaRDD</code>, then collect the rows and then access each field using the correct type information.

Extract information from a `org.apache.spark.sql.Row`

Tags:

scala

apache-spark

apache-spark-sql

I have Array[org.apache.spark.sql.Row] returned by sqc.sql(sqlcmd).collect():

Array([10479,6,10], [8975,149,640], ...)

I can get the individual values:

scala> pixels(0)(0)
res34: Any = 10479

but they are Any, not Int.

How do I extract them as Int?

The most obvious solution did not work:

scala> pixels(0).getInt(0)
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Int

PS. I can do pixels(0)(0).toString.toInt or pixels(0).getString(0).toInt, but they feel wrong...

385

asked Jan 20 '15 00:01

sds

2 Answers

Using getInt should work. Here is a contrived example as a proof of concept

import org.apache.spark.sql._
sc.parallelize(Array(1,2,3)).map(Row(_)).collect()(0).getInt(0)

This return 1

However,

sc.parallelize(Array("1","2","3")).map(Row(_)).collect()(0).getInt(0)

fails. So, it looks like it is coming in as a string and you will have to convert to an int manually.

sc.parallelize(Array("1","2","3")).map(Row(_)).collect()(0).getString(0).toInt

The documentation states that getInt:

Returns the value of column i as an int. This function will throw an exception if the value is at i is not an integer, or if it is null.

So, it will not try to cast for you it seems

148

answered Sep 25 '22 07:09

Justin Pihony

The Row class (also see https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.package) has methods getInt(i: Int), getDouble(i: Int) etc.

Also note that a SchemaRDD is an RDD[Row] plus a schema that tells you which column has which data type. If you do .collect() you will only get an Array[Row] which does not have that information. So unless you know for sure what your data looks like, get the schema from the SchemaRDD, then collect the rows and then access each field using the correct type information.

answered Sep 25 '22 07:09

tgpfeiffer

Related questions
                            
                                Unit test logger messages using specs2 + scalalogging
                            
                                How is Ostrich used for configuration?
                            
                                clojure and scala interop
                            
                                Does the incremental compilation speed in Scala depend on the number of classes per file?
                            
                                strange error message: bad symbolic reference. A signature in package.class refers to term apache in package org which is not available
                            
                                Relation between Akka and scala.actors in 2.10
                            
                                Hocon: Read an array of objects from a configuration file
                            
                                How to estimate the serialization size of objects in Java without actually serializing them?
                            
                                Can't I define defaults if I define multiple overloaded constructors in Scala?
                            
                                The strange case of multiple Futures in Scala
                            
                                How to convert a nested scala collection to a nested Java collection
                            
                                Why is dataset.count causing a shuffle! (spark 2.2)
                            
                                GUI programming in Scala
                            
                                Scala - ambiguous reference to overloaded definition -- with varargs [duplicate]
                            
                                Scala: illegal inheritance; self-type Y does not conform to X's selftype SELF
                            
                                Comparison of Scala (latest 2.10) versus Groovy++ (latest 0.9.1?) [closed]
                            
                                Play framework how do sessions and cookies work?
                            
                                How does the memory management of closures in Scala work?
                            
                                Scala: How can I install a package system wide for working with in the repl?
                            
                                What's the correct way to enforce constraints on case class values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With