Does anyone how to do pagination in spark sql query? I need to use spark sql but don't know how to do pagination. Tried: <pre class="prettyprint"><code>select * from person limit 10, 10 </code></pre>

karthik's answer will fail if there are duplicate rows in the dataframe. 'except' will remove all rows in df1 which are in df2 . <pre class="prettyprint"><code>val filteredRdd = df.rdd.zipWithIndex().collect { case (r, i) if 10 >= start && i <=20 => r } val newDf = sqlContext.createDataFrame(filteredRdd, df.schema) </code></pre>

how to implement spark sql pagination query

Tags:

apache-spark

apache-spark-sql

Does anyone how to do pagination in spark sql query?

I need to use spark sql but don't know how to do pagination.

Tried:

select * from person limit 10, 10

740

asked Mar 24 '15 08:03

simafengyun

2 Answers

It has been 6 years, don't know if it was possible back then

I would add a sequential id on the answer and search for registers between offset and offset + limit

On pure spark sql query it would be something like this, for offset 10 and limit 10

WITH count_person AS (
    SELECT *, monotonically_increasing_id() AS count FROM person)
SELECT * FROM count_person WHERE count > 10 AND count < 20

On Pyspark it would be very similar

import pyspark.sql.functions as F

offset = 10
limit = 10
df = df.withColumn('_id', F.monotonically_increasing_id())
df = df.where(F.col('_id').between(offset, offset + limit))

Its flexible and fast enough even for a big volume of data

188

answered Sep 18 '22 08:09

Khanis Rok

karthik's answer will fail if there are duplicate rows in the dataframe. 'except' will remove all rows in df1 which are in df2 .

val filteredRdd = df.rdd.zipWithIndex().collect { case (r, i) if 10 >= start && i <=20 => r }
val newDf = sqlContext.createDataFrame(filteredRdd, df.schema)

answered Sep 22 '22 08:09

Himaprasoon

Related questions
                            
                                How to decide on the number of partitions required for input data size and cluster resources?
                            
                                Spark Streaming textFileStream not supporting wildcards
                            
                                When to prefer Hadoop MapReduce over Spark?
                            
                                How to join big dataframes in Spark SQL? (best practices, stability, performance)
                            
                                How to fetch offset id while consuming Kafka from Spark, save it in Cassandra and use it to restart Kafka?
                            
                                How to run Spark Scala code on Amazon EMR
                            
                                Apache Spark Structured Streaming vs Apache Flink: what is the difference?
                            
                                Spark UI History server on Kubernetes?
                            
                                Spark structured streaming app reading from multiple Kafka topics
                            
                                "TypeError: an integer is required (got type bytes)" when importing pyspark on Python 3.8 [duplicate]
                            
                                Spark Clusters: worker info doesn't show on web UI
                            
                                Apache Spark: How to create a matrix from a DataFrame?
                            
                                How to connect Zeppelin to Spark 1.5 built from the sources?
                            
                                Merging multiple rows in a spark dataframe into a single row
                            
                                Spark: difference of semantics between reduce and reduceByKey
                            
                                Is Spark's KMeans unable to handle bigdata?
                            
                                Spark dataframe to arrow
                            
                                Is there a difference between OUTER & FULL_OUTER in Spark SQL?
                            
                                Calculate Cosine Similarity Spark Dataframe
                            
                                SparkSession: ActiveSession vs DefaultSession

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With