Does anyone how to do pagination in spark sql query?
I need to use spark sql but don't know how to do pagination.
Tried:
select * from person limit 10, 10
Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API. Usable in Java, Scala, Python and R. Apply functions to results of SQL queries.
Pagination is a strategy employed when querying any dataset that holds more than just a few hundred records. Thanks to pagination, we can split our large dataset into chunks ( or pages ) that we can gradually fetch and display to the user, thus reducing the load on the database.
It has been 6 years, don't know if it was possible back then
I would add a sequential id on the answer and search for registers between offset and offset + limit
On pure spark sql query it would be something like this, for offset 10 and limit 10
WITH count_person AS (
SELECT *, monotonically_increasing_id() AS count FROM person)
SELECT * FROM count_person WHERE count > 10 AND count < 20
On Pyspark it would be very similar
import pyspark.sql.functions as F
offset = 10
limit = 10
df = df.withColumn('_id', F.monotonically_increasing_id())
df = df.where(F.col('_id').between(offset, offset + limit))
Its flexible and fast enough even for a big volume of data
karthik's answer will fail if there are duplicate rows in the dataframe. 'except' will remove all rows in df1 which are in df2 .
val filteredRdd = df.rdd.zipWithIndex().collect { case (r, i) if 10 >= start && i <=20 => r }
val newDf = sqlContext.createDataFrame(filteredRdd, df.schema)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With