Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark jdbc df limit... what is it doing?

I'm trying to learn how to get a feel of what is going on inside Spark, and here's my current confusion. I'm trying to read first 200 rows from an Oracle table into Spark:

val jdbcDF = spark.read.format("jdbc").options(
  Map("url" -> "jdbc:oracle:thin:...",
  "dbtable" -> "schema.table",
  "fetchSize" -> "5000",
  "partitionColumn" -> "my_row_id",
  "numPartitions" -> "16",
  "lowerBound" -> "0",
  "upperBound" -> "9999999"
  )).load()

jdbcDF.limit(200).count()

This, I would expect, to be fairly quick. Similar action on a table with 500K rows completes in a reasonable time. In this particular case, the table is much bigger (hundreds of millions of rows), but limit(200) would, I'd think, make it fast? How do I go about figuring out where it spending its time?

like image 252
MK. Avatar asked Oct 03 '16 16:10

MK.


1 Answers

As a matter of fact, spark isn't capable yet of pushing down the limit predicate.

So actually what's happening in this case scenario is that it's pulling all the data to spark and then limit and count. What you would need is to use it in a subquery as a table argument.

e.g :

val jdbcDF = spark.read.format("jdbc").options(
  Map("url" -> "jdbc:oracle:thin:...",
  "dbtable" -> "(select * from schema.table limit 200) as t",
  "fetchSize" -> "5000",
  "partitionColumn" -> "my_row_id",
  "numPartitions" -> "16",
  "lowerBound" -> "0",
  "upperBound" -> "9999999"
  )).load()

So mainly where it is spending time is pull all the data to spark.

You can also pass the limit dynamically in the subquery:

val n : Int = ???

val jdbcDF = spark.read.format("jdbc").options(
  Map("url" -> "jdbc:oracle:thin:...",
  "dbtable" -> s"(select * from schema.table limit $n) as t",
  "fetchSize" -> "5000",
  "partitionColumn" -> "my_row_id",
  "numPartitions" -> "16",
  "lowerBound" -> "0",
  "upperBound" -> "9999999"
  )).load()

There is a JIRA ticket (SPARK-10899) in progress to solve this issue but it's been hanging for almost a year.

EDIT: As the issue in the JIRA above was flagged as duplicate. You can continue on tracking the issue here - SPARK-12126. I hope that this answers your question.

like image 56
eliasah Avatar answered Nov 15 '22 12:11

eliasah