my spark sql limit is very slow

Tags:

I use spark to read from elasticsearch.Like

select col from index limit 10;

The problem is that the index is very large, it contains 100 billion rows.And spark generate thousands of tasks to finish the job.
All I need is 10 rows, even 1 tasks returns 10 rows that can finish the job.I don't need so many tasks.
Limit is very slow even limit 1.
Code：

sql = select col from index limit 10
sqlExecListener.sparkSession.sql(sql).createOrReplaceTempView(tempTable)

696

asked Nov 30 '17 02:11

no123ff

1 Answers

The source code of limit shows that it will take the first limit elements for every partition, and then it will scan all partitions.

To speed up the query you can specify one value of the partition key. Suppose that you are using day as the partition key, the following query will be much faster

select col from index where day = '2018-07-10' limit 10;

148

answered Oct 20 '22 01:10

secfree

Related questions
                            
                                Spark Truncated Spark Plan
                            
                                Spark createDataFrame(df.rdd, df.schema) vs checkPoint for breaking lineage
                            
                                What is the difference between Driver and Application manager in spark
                            
                                spark <console>:12: error: not found: value sc
                            
                                Why are aggregate and fold two different APIs in Spark?
                            
                                Spark can no longer execute jobs. Executors fail to create directory
                            
                                SparkSQL MissingRequirementError when registering table
                            
                                How to get Histogram of all columns in a large CSV / RDD[Array[double]] using Apache Spark Scala?
                            
                                How to control number of parquet files generated when using partitionBy
                            
                                Numpy and static linking
                            
                                Difference between Apache spark mllib.linalg vectors and spark.util vectors for machine learning
                            
                                Spark Exception : Task failed while writing rows
                            
                                Spark netlib-java BLAS
                            
                                how to make RMSE(root mean square error) small when use ALS of spark?
                            
                                ALS model - how to generate full_u * v^t * v?
                            
                                Apache Toree to connect to a remote spark cluster
                            
                                Custom log4j.properties on AWS EMR
                            
                                (python) Spark .textFile(s3://...) access denied 403 with valid credentials
                            
                                Reading JSON files into Spark Dataset and adding columns from a separate Map
                            
                                How do I interpret Input size / records in Spark Stage UI

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

my spark sql limit is very slow

Tags:

elasticsearch

apache-spark

apache-spark-sql

spark-submit

no123ff

People also ask

1 Answers

secfree

Recent Activity

Donate For Us