I have the following dataframe <pre class="prettyprint"><code>val transactions_with_counts = sqlContext.sql( """SELECT user_id AS user_id, category_id AS category_id, COUNT(category_id) FROM transactions GROUP BY user_id, category_id""") </code></pre> I'm trying to convert the rows to Rating objects but since x(0) returns an array this fails <pre class="prettyprint"><code>val ratings = transactions_with_counts .map(x => Rating(x(0).toInt, x(1).toInt, x(2).toInt)) </code></pre> <blockquote> error: value toInt is not a member of Any </blockquote>

Lets start with some dummy data: <pre class="prettyprint"><code>val transactions = Seq((1, 2), (1, 4), (2, 3)).toDF("user_id", "category_id") val transactions_with_counts = transactions .groupBy($"user_id", $"category_id") .count transactions_with_counts.printSchema // root // |-- user_id: integer (nullable = false) // |-- category_id: integer (nullable = false) // |-- count: long (nullable = false) </code></pre> There are a few ways to access <code>Row</code> values and keep expected types: <ol> <li> Pattern matching <pre class="prettyprint"><code>import org.apache.spark.sql.Row transactions_with_counts.map{ case Row(user_id: Int, category_id: Int, rating: Long) => Rating(user_id, category_id, rating) } </code></pre> </li> <li> Typed <code>get*</code> methods like <code>getInt</code>, <code>getLong</code>: <pre class="prettyprint"><code>transactions_with_counts.map( r => Rating(r.getInt(0), r.getInt(1), r.getLong(2)) ) </code></pre> </li> <li> <code>getAs</code> method which can use both names and indices: <pre class="prettyprint"><code>transactions_with_counts.map(r => Rating( r.getAs[Int]("user_id"), r.getAs[Int]("category_id"), r.getAs[Long](2) )) </code></pre> It can be used to properly extract user defined types, including <code>mllib.linalg.Vector</code>. Obviously accessing by name requires a schema. </li> <li> Converting to statically typed <code>Dataset</code> (Spark 1.6+ / 2.0+): <pre class="prettyprint"><code>transactions_with_counts.as[(Int, Int, Long)] </code></pre> </li> </ol>

Using Datasets you can define Ratings as follows: <pre class="prettyprint"><code>case class Rating(user_id: Int, category_id:Int, count:Long) </code></pre> The Rating class here has a column name 'count' instead of 'rating' as zero323 suggested. Thus the rating variable is assigned as follows: <pre class="prettyprint"><code>val transactions_with_counts = transactions.groupBy($"user_id", $"category_id").count val rating = transactions_with_counts.as[Rating] </code></pre> This way you will not run into run-time errors in Spark because your Rating class column name is identical to the 'count' column name generated by Spark on run-time.

Spark extracting values from a Row

Tags:

scala

apache-spark

apache-spark-sql

I have the following dataframe

val transactions_with_counts = sqlContext.sql(   """SELECT user_id AS user_id, category_id AS category_id,   COUNT(category_id) FROM transactions GROUP BY user_id, category_id""")

I'm trying to convert the rows to Rating objects but since x(0) returns an array this fails

val ratings = transactions_with_counts   .map(x => Rating(x(0).toInt, x(1).toInt, x(2).toInt))

error: value toInt is not a member of Any

699

asked Oct 08 '15 06:10

Sam

2 Answers

Lets start with some dummy data:

val transactions = Seq((1, 2), (1, 4), (2, 3)).toDF("user_id", "category_id")  val transactions_with_counts = transactions   .groupBy($"user_id", $"category_id")   .count  transactions_with_counts.printSchema  // root // |-- user_id: integer (nullable = false) // |-- category_id: integer (nullable = false) // |-- count: long (nullable = false)

There are a few ways to access Row values and keep expected types:

Pattern matching

import org.apache.spark.sql.Row  transactions_with_counts.map{   case Row(user_id: Int, category_id: Int, rating: Long) =>     Rating(user_id, category_id, rating) }

Typed get* methods like getInt, getLong:

transactions_with_counts.map(   r => Rating(r.getInt(0), r.getInt(1), r.getLong(2)) )

getAs method which can use both names and indices:
```
transactions_with_counts.map(r => Rating(   r.getAs[Int]("user_id"), r.getAs[Int]("category_id"), r.getAs[Long](2) )) 
```
It can be used to properly extract user defined types, including mllib.linalg.Vector. Obviously accessing by name requires a schema.
Converting to statically typed Dataset (Spark 1.6+ / 2.0+):
```
transactions_with_counts.as[(Int, Int, Long)] 
```

188

answered Sep 24 '22 22:09

zero323

Using Datasets you can define Ratings as follows:

case class Rating(user_id: Int, category_id:Int, count:Long)

The Rating class here has a column name 'count' instead of 'rating' as zero323 suggested. Thus the rating variable is assigned as follows:

val transactions_with_counts = transactions.groupBy($"user_id", $"category_id").count  val rating = transactions_with_counts.as[Rating]

This way you will not run into run-time errors in Spark because your Rating class column name is identical to the 'count' column name generated by Spark on run-time.

answered Sep 23 '22 22:09

user-asterix

Related questions
                            
                                What's the right way to use scala.io.Source?
                            
                                Why is Scala's type inference not as powerful as Haskell's?
                            
                                How can a private class method be tested in Scala?
                            
                                Viewing Scaladoc in Eclipse
                            
                                println vs System.out.println in Scala
                            
                                What's the difference between `::` and `+:` for prepending to a list)?
                            
                                Akka Stream Kafka vs Kafka Streams
                            
                                What is the _root_ package in Scala?
                            
                                Scala class and case class == comparison
                            
                                mutable vs. immutable in Scala collections
                            
                                Convert Java Map to Scala Map
                            
                                "Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar" when starting the Scala interpreter
                            
                                How are Scala traits compiled into Java bytecode?
                            
                                Convert list in Scala to a formatted string
                            
                                Are multiple self-types possible?
                            
                                Scala import not working - object <name> is not a member of package, sbt preppends current package namespace in imports
                            
                                permanently hidden warning in scala application
                            
                                How do I read a large CSV file with Scala Stream class?
                            
                                Comparing Lift with Play2 [closed]
                            
                                Linearization order in Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With