Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to implement spark sql pagination query

Does anyone how to do pagination in spark sql query?

I need to use spark sql but don't know how to do pagination.

Tried:

select * from person limit 10, 10
like image 740
simafengyun Avatar asked Mar 24 '15 08:03

simafengyun


People also ask

Can we use SQL queries directly in Spark?

Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API. Usable in Java, Scala, Python and R. Apply functions to results of SQL queries.

How is pagination implemented in database?

Pagination is a strategy employed when querying any dataset that holds more than just a few hundred records. Thanks to pagination, we can split our large dataset into chunks ( or pages ) that we can gradually fetch and display to the user, thus reducing the load on the database.


2 Answers

It has been 6 years, don't know if it was possible back then

I would add a sequential id on the answer and search for registers between offset and offset + limit

On pure spark sql query it would be something like this, for offset 10 and limit 10

WITH count_person AS (
    SELECT *, monotonically_increasing_id() AS count FROM person)
SELECT * FROM count_person WHERE count > 10 AND count < 20

On Pyspark it would be very similar

import pyspark.sql.functions as F

offset = 10
limit = 10
df = df.withColumn('_id', F.monotonically_increasing_id())
df = df.where(F.col('_id').between(offset, offset + limit))

Its flexible and fast enough even for a big volume of data

like image 188
Khanis Rok Avatar answered Sep 18 '22 08:09

Khanis Rok


karthik's answer will fail if there are duplicate rows in the dataframe. 'except' will remove all rows in df1 which are in df2 .

val filteredRdd = df.rdd.zipWithIndex().collect { case (r, i) if 10 >= start && i <=20 => r }
val newDf = sqlContext.createDataFrame(filteredRdd, df.schema)
like image 21
Himaprasoon Avatar answered Sep 22 '22 08:09

Himaprasoon