Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark DataFrame limit function takes too much time to show

import pyspark
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
import findspark
from pyspark.sql.functions import countDistinct
spark = SparkSession.builder \
.master("local[*]") \
.appName("usres mobile related information analysis") \
.config("spark.submit.deployMode", "client") \
.config("spark.executor.memory","3g") \
.config("spark.driver.maxResultSize", "1g") \
.config("spark.executor.pyspark.memory","3g") \
.enableHiveSupport() \
.getOrCreate()

handset_info = ora_tmp.select('some_value','some_value','some_value','some_value','some_value','some_value','some_value')

I configure the spark with 3gb execution memory and 3gb execution pyspark memory.My Database has more than 70 Million row. Show i call the

 handset_info.show()

method it is showing the top 20 row in between 2-5 second. But when i try to run the following code

mobile_info_df = handset_info.limit(30)
mobile_info_df.show()

to show the top 30 rows the it takes too much time(3-4 hour). Is it logical to take that much time. Is there any problem in my configuration. Configuration of my laptop is-

  • Core i7(4 core) laptop with 8gb ram
like image 557
Taimur Islam Avatar asked Feb 10 '19 09:02

Taimur Islam


3 Answers

Spark copies the parameter you passed to limit() to each partition so, in your case, it tries to read 30 rows per partition. I guess you happened to have a huge number of partitions (which is not good in any case). Try df.coalesce(1).limit(30).show() and it should run as fast as df.show().

like image 109
minhle_r7 Avatar answered Nov 14 '22 12:11

minhle_r7


Your configuration is fine. This huge duration difference is caused by underlying implementation. The difference is that limit() reads all of the 70 million rows before it creates a dataframe with 30 rows. Show() in contrast just takes the first 20 rows of the existing dataframe and has therefore only to read this 20 rows. In case you are just interessted in showing 30 instead of 20 rows, you can call the show() method with 30 as parameter:

df.show(30, truncate=False)
like image 25
cronoik Avatar answered Nov 14 '22 10:11

cronoik


As you've already experienced, limit() with large data has just terrible performance. Wanted to share a workaround for anyone else with this problem. If the limit count doesn't have to be exact, use sort() or orderBy() to sort a column, and use filter() to grab top k% of the rows.

like image 42
piritocle Avatar answered Nov 14 '22 12:11

piritocle