I'm trying to learn how to get a feel of what is going on inside Spark, and here's my current confusion. I'm trying to read first 200 rows from an Oracle table into Spark:
val jdbcDF = spark.read.format("jdbc").options(
Map("url" -> "jdbc:oracle:thin:...",
"dbtable" -> "schema.table",
"fetchSize" -> "5000",
"partitionColumn" -> "my_row_id",
"numPartitions" -> "16",
"lowerBound" -> "0",
"upperBound" -> "9999999"
)).load()
jdbcDF.limit(200).count()
This, I would expect, to be fairly quick. Similar action on a table with 500K rows completes in a reasonable time. In this particular case, the table is much bigger (hundreds of millions of rows), but limit(200) would, I'd think, make it fast? How do I go about figuring out where it spending its time?
As a matter of fact, spark isn't capable yet of pushing down the limit
predicate.
So actually what's happening in this case scenario is that it's pulling all the data to spark and then limit and count. What you would need is to use it in a subquery as a table argument.
e.g :
val jdbcDF = spark.read.format("jdbc").options(
Map("url" -> "jdbc:oracle:thin:...",
"dbtable" -> "(select * from schema.table limit 200) as t",
"fetchSize" -> "5000",
"partitionColumn" -> "my_row_id",
"numPartitions" -> "16",
"lowerBound" -> "0",
"upperBound" -> "9999999"
)).load()
So mainly where it is spending time is pull all the data to spark.
You can also pass the limit dynamically in the subquery:
val n : Int = ???
val jdbcDF = spark.read.format("jdbc").options(
Map("url" -> "jdbc:oracle:thin:...",
"dbtable" -> s"(select * from schema.table limit $n) as t",
"fetchSize" -> "5000",
"partitionColumn" -> "my_row_id",
"numPartitions" -> "16",
"lowerBound" -> "0",
"upperBound" -> "9999999"
)).load()
There is a JIRA ticket (SPARK-10899) in progress to solve this issue but it's been hanging for almost a year.
EDIT: As the issue in the JIRA above was flagged as duplicate. You can continue on tracking the issue here - SPARK-12126. I hope that this answers your question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With