Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the real difference between Append mode and Update mode in Spark Streaming?

Tags:

What is the real difference between Append mode and Update mode in Spark Streaming?

According to the documentation:

Append mode (default) - This is the default mode, where only the new rows added to the Result Table since the last trigger will be outputted to the sink. This is supported for only those queries where rows added to the Result Table is never going to change. Hence, this mode guarantees that each row will be output only once (assuming fault-tolerant sink). For example, queries with only select, where, map, flatMap, filter, join, etc. will support Append mode.

and

Update mode - (Available since Spark 2.1.1) Only the rows in the Result Table that were updated since the last trigger will be outputted to the sink. More information to be added in future releases.

My confusion with append mode: It says where "only" the new rows added to the Result Table since the last trigger will be outputted to the sink. so, for example, say we have three rows

r1, r2, r3 arrived at t1, t2, t3 where t1<t2<t3

Now say at t4 the row r2 got overwritten and if so, we will never see that in the sink while we are operating in append mode? Isn't it like losing a write?

My confusion with update mode: It says "only" the rows in the result table that were updated since the last trigger will be outputted to the sink. Does it imply that rows should already exist and only if the existing rows are updated it will output to sink? what happens if there are no existing rows and a new row comes in while we are in this update mode?

like image 587
user1870400 Avatar asked Feb 22 '18 12:02

user1870400


1 Answers

Looking carefully at the description of Append mode in the newest version of the docs, we see that it says

Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change.

In other words there should never be any overwrites. In the senario where you know there can be updates, use the Update mode.


For the second question about Update mode, in the docs the full quote is

Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.

The last sentence is important here. It is equivalent to the Append mode when there are no aggregations (which will make actual updates). New rows will hence be added as normal in this mode and not simply skipped.


For completeness, here is the third mode currently available:

Complete Mode - The entire updated Result Table will be written to the external storage. It is up to the storage connector to decide how to handle writing of the entire table.

The documentation contains a list of different query types and the supported modes as well as some helpful notes.

like image 187
Shaido Avatar answered Sep 20 '22 12:09

Shaido