Why Spark dataframe cache doesn't work here

Tags:

I just wrote a toy class to test Spark dataframe (actually Dataset since I'm using Java).

Dataset<Row> ds = spark.sql("select id,name,gender from test2.dummy where dt='2018-12-12'");
ds = ds.withColumn("dt", lit("2018-12-17"));
ds.cache();
ds.write().mode(SaveMode.Append).insertInto("test2.dummy");
//
System.out.println(ds.count());

According to my understanding, there're 2 actions, "insertInto" and "count".

I debug the code step by step, when running "insertInto", I see several lines of:

19/01/21 20:14:56 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]

When running "count", I still see similar logs:

19/01/21 20:15:26 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]

I have 2 questions:

1) When there're 2 actions on same dataframe like above, if I don't call ds.cache or ds.persist explicitly, will the 2nd action always causes the re-executing of the sql query?

2) If I understand the log correctly, both actions trigger hdfs file reading, does that mean the ds.cache() actually doesn't work here? If so, why it doesn't work here?

Many thanks.

880

asked Jan 21 '19 13:01

gfytd

1 Answers

It's because you append into the table where ds is created from, so ds needs to be recomputed because the underlying data changed. In such cases, spark invalidates the cache. If you read e.g. this Jira (https://issues.apache.org/jira/browse/SPARK-24596):

When invalidating a cache, we invalid other caches dependent on this cache to ensure cached data is up to date. For example, when the underlying table has been modified or the table has been dropped itself, all caches that use this table should be invalidated or refreshed.

Try to run the ds.count before inserting into the table.

answered Sep 27 '22 22:09

Raphael Roth

Related questions
                            
                                How to declare a Kotlin function with return type 'void' for a java caller?
                            
                                Elasticsearch crashes after showing t: failed to read local state , exiting
                            
                                ClassCastException when opening excel file
                            
                                error: incompatible types: possible lossy conversion from int to short. I don't know why i'm getting this error message
                            
                                jackson serialization is excluding double value 0.0
                            
                                What functionality is the java code responsible for within the GraphEditor in the mxgraph example for javascript?
                            
                                How to use new androidx.media2.widget.VideoView
                            
                                How to improve performance of a simple select query in oracle
                            
                                google api refresh token returns null with react-google-login
                            
                                What is the best practice: Use prototype bean instead of new () operator
                            
                                Java Spring RestFull API
                            
                                Cannot use custom spring converter in unit test
                            
                                Can't Split Class into Smaller Ones
                            
                                Proper way to enrich Keycloak token via external service
                            
                                How to write one regular expression to meet all cases and print specified variable
                            
                                Is it reasonable to throw an exception from an asynchronous method?
                            
                                Are there any direct or indirect performance benefits of java 8 sequential streams?
                            
                                Setting Turkish and English locale: translate Turkish characters to Latin equivalents
                            
                                Is it possible to store secrets on the stack in java?
                            
                                invokeAndWait method in SwingUtilities

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why Spark dataframe cache doesn't work here

Tags:

java

dataframe

caching

apache-spark

gfytd

People also ask

1 Answers

Raphael Roth

Recent Activity

Donate For Us