The scala.MatchError (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
exception happens when I try to access DataFrame
row elements. The following code counts book pairs, where count of a pair equals the number of readers who read this pair of books.
Interesting thing is that exception happens only when trainPairs
are created as a result of trainDf.join(...)
. In case the same data structure is created inline as:
case class BookPair (book1:Int, book2:Int, cnt:Int, name1: String, name2: String)
val recs = Array(
BookPair(1, 2, 3, "book1", "book2"),
BookPair(2, 3, 1, "book2", "book3"),
BookPair(1, 3, 2, "book1", "book3"),
BookPair(1, 4, 5, "book1", "book4"),
BookPair(2, 4, 7, "book2", "book4")
)
This exception does not happen at all!
The complete code that produce this exception:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, DataFrame}
import org.apache.spark.sql.functions._
object Scratch {
case class Book(book: Int, reader: Int, name:String)
val recs = Array(
Book(book = 1, reader = 30, name = "book1"),
Book(book = 2, reader = 10, name = "book2"),
Book(book = 3, reader = 20, name = "book3"),
Book(book = 1, reader = 20, name = "book1"),
Book(book = 1, reader = 10, name = "book1"),
Book(book = 1, reader = 40, name = "book1"),
Book(book = 2, reader = 40, name = "book2"),
Book(book = 1, reader = 100, name = "book1"),
Book(book = 2, reader = 100, name = "book2"),
Book(book = 3, reader = 100, name = "book3"),
Book(book = 4, reader = 100, name = "book4"),
Book(book = 5, reader = 100, name = "book5"),
Book(book = 4, reader = 500, name = "book4"),
Book(book = 1, reader = 510, name = "book1"),
Book(book = 2, reader = 30, name = "book2"))
def main(args: Array[String]) {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
// set up environment
val conf = new SparkConf()
.setMaster("local[5]")
.setAppName("Scratch")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val data = sc.parallelize(recs)
/**
* Remove readers with many books
count books by reader
and filter readers with books count > 10
*/
val maxBookCnt = 4
val readersWithLotsOfBooksRDD = data.map(r => (r.reader, 1)).reduceByKey((x, y) => x + y).filter{ case (_, x) => x > maxBookCnt }
readersWithLotsOfBooksRDD.collect()
val readersWithBooksRDD = data.map( r => (r.reader, (r.book, r.name) ))
readersWithBooksRDD.collect()
println("*** Records left after removing readers with maxBookCnt > "+maxBookCnt)
val data2 = readersWithBooksRDD.subtractByKey(readersWithLotsOfBooksRDD)
data2.foreach(println)
// *** Prepair train data
val trainData = data2.map(tuple => tuple match {
case (reader,v) => Book(reader = reader, book = v._1, name = v._2)
})
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val trainDf = trainData.toDF()
println("*** Creating pairs...")
val trainPairs = trainDf.join(
trainDf.select($"book" as "r_book", $"reader" as "r_reader", $"name" as "r_name"),
$"reader" === $"r_reader" and $"book" < $"r_book")
.groupBy($"book", $"r_book", $"name", $"r_name")
.agg($"book",$"r_book", count($"reader") as "cnt", $"name", $"r_name")
trainPairs.registerTempTable("trainPairs")
println("*** Pairs Schema:")
trainPairs.printSchema()
// Order pairs by count
val pairsSorted = sqlContext.sql("SELECT * FROM trainPairs ORDER BY cnt DESC")
println("*** Pairs Sorted by Count")
pairsSorted.show
// Key pairs by book
val keyedPairs = trainPairs.rdd.map({case Row(book1: Int, book2: Int, count: Int, name1: String, name2:String)
=> (book1,(book2, count, name1, name2))})
println("*** keyedPairs:")
keyedPairs.foreach(println)
}
}
Any ideas?
Update
zero323 writes:
"It throws an exception because schema of trainPairs doesn't match pattern you've provided. Schema looks like this:
root
|-- book: integer (nullable = false)
|-- r_book: integer (nullable = false)
|-- name: string (nullable = true)
|-- r_name: string (nullable = true)
|-- book: integer (nullable = false)
|-- r_book: integer (nullable = false)
|-- cnt: long (nullable = false)
|-- name: string (nullable = true)
|-- r_name: string (nullable = true)
Ok, but how can I find a complete schema of trainPairs
? Why then when I print trainPairs
schema with command:
trainPairs.printSchema()
I get only part of this schema:
root
|-- book: integer (nullable = false)
|-- r_book: integer (nullable = false)
|-- cnt: long (nullable = false)
|-- name: string (nullable = true)
|-- r_name: string (nullable = true)
How can I print / find a complete schema of trainPairs
?
Besides
Row(Int, Int, String, String, Int, Int, Long, String, String)
results in the same scala.MatchError
!
As I found out excepion was caused by wrong type of count
row field. It should be Long
and not Int
. So instead of:
// Key pairs by book
val keyedPairs = trainPairs.rdd.map({case Row(book1: Int, book2: Int, count: Int, name1: String, name2:String)
=> (book1,(book2, count, name1, name2))})
The correct code should be:
val keyedPairs = trainPairs.rdd.map({case Row(book1: Int, book2: Int, count: Long, name1: String, name2:String)
=> (book1,(book2, count, name1, name2))})
And everything would work as expected.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With