Caching ordered Spark DataFrame creates unwanted job

Tags:

I want to convert a RDD to a DataFrame and want to cache the results of the RDD:

from pyspark.sql import *
from pyspark.sql.types import *
import pyspark.sql.functions as fn

schema = StructType([StructField('t', DoubleType()), StructField('value', DoubleType())])

df = spark.createDataFrame(
    sc.parallelize([Row(t=float(i/10), value=float(i*i)) for i in range(1000)], 4), #.cache(),
    schema=schema,
    verifySchema=False
).orderBy("t") #.cache()

If you don't use a cache function no job is generated.
If you use cache only after the orderBy 1 jobs is generated for cache:
If you use cache only after the parallelize no job is generated.

Why does cache generate a job in this one case? How can I avoid the job generation of cache (caching the DataFrame and no RDD)?

Edit: I investigated more into the problem and found that without the orderBy("t") no job is generated. Why?

500

asked Mar 22 '17 12:03

R1tschY

1 Answers

I submitted a bug ticket and it was closed with following reason:

Caching requires the backing RDD. That requires we also know the backing partitions, and this is somewhat special for a global order: it triggers a job (scan) because we need to determine the partition bounds.

122

answered Sep 21 '22 18:09

R1tschY

Related questions
                            
                                python equivalent to java guava Preconditions
                            
                                Rendering Emoji with PIL
                            
                                How does __test__ = False magic attribute work for test discovery
                            
                                How to represent a custom PostgreSQL domain in SQLAlchemy?
                            
                                Mocking a local variable of a function in python
                            
                                How do I use a virtualenv to evaluate Python in Light Table?
                            
                                How can I draw inline line labels in matplotlib?
                            
                                Python multiprocessing Pool on Windows 8.1 spawns only one worker
                            
                                Hadoop streaming jobs SUCCEEDED but killed by ApplicationMaster
                            
                                How to implement an append-only versioned model in SQLAlchemy
                            
                                pycrypto - Ciphertext with incorrect length
                            
                                Construct sparse matrix on disk on the fly in Python
                            
                                Why does hash(None) change across different platforms and in different calls?
                            
                                How to avoid repetitive filter specification in mako %def's?
                            
                                How to structure a program to work with minesweeper configurations
                            
                                Django: Forcing CSRF token on all responses
                            
                                Is there a fast Way to return Sin and Cos of the same value in Python?
                            
                                unbuffered read from stdin in python
                            
                                Kivy--Plyer--Android--sending notification while app is not running
                            
                                How to process panel data for use in a recurrent neural network (RNN)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Caching ordered Spark DataFrame creates unwanted job

Tags:

python

apache-spark

apache-spark-sql

pyspark

pyspark-sql

R1tschY

People also ask

1 Answers

R1tschY

Recent Activity

Donate For Us