Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fine grained transformation vs coarse grained transformations

Could anyone please explain the difference between fine grained transformation vs coarse grained transformations in context of Spark? I was reading the paper on RDDs (https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf) and not very clear how coarse gained transformation provides fault tolerance in an effective way.

like image 487
Amar Avatar asked Oct 04 '14 17:10

Amar


People also ask

What is the difference between coarse-grained and fine-grained?

The word 'granular' is used to describe something that is made up of multiple elements. If the elements are small, we call it "fine-grained," and if the elements are large, we call it "coarse-grained." These are terms typically used in economics, computer science and geology.

What is coarse-grained operations in Spark?

The coarse-grained operation means to apply operations on all the objects at once. Fine-grained operations mean to apply operations on a smaller set. We generally apply coarse-grained operation, as it works on entire cluster simultaneously. We can also create RDDs by its cache and divide it manually.

Why do we need RDD in Spark?

RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.

What does coarse-grained mean in science?

(a) Said of a crystalline rock, and of its texture, in which the individual minerals are relatively large; specif. said of an igneous rock whose particles have an average diameter greater than 5 mm (0.2 in.).


1 Answers

A fine grained update would be an update to one record in a database whereas coarse grained is generally functional operators (like used in spark) for example map, reduce, flatMap, join. Spark's model takes advantage of this because once it saves your small DAG of operations (small compared to the data you are processing) it can use that to recompute as long as the original data is still there. With fine grained updates you cannot recompute because saving the updates could potentially cost as much as saving the data itself, basically if you update each record out of billions separately you have to save the information to compute each update, whereas with coarse grained you can save one function that updates a billion records. Clearly though this comes at the cost of not being as flexible as a fine grained model.

like image 163
aaronman Avatar answered Sep 22 '22 05:09

aaronman