I am new to the project, and I am trying to create a connector between Dataflow and a database. The documentation clearly states that I should use a Source and a Sink but I see a lot of people using directly a PTransform associated with a PInput or a PDone. The source/sink API is in experimental (which explaines all the examples with the PTransform), but seems more easy to integrate it with a custom runner (ie: spark for example). If I refer to the code, the two methods are used. I cannot see any use case where it will be more interesting to use the PTransform API. Is the Source/Sink API is supposed to remplace the PTranform API? Did I miss something that clearly differentiate the two methods? Is the Source/Sink API stable enough to be considered the good way to code inputs and outputs? Thank for you advice!

The philosophy of Dataflow is that <code>PTransform</code> is the main unit of abstraction and composability, i.e., any self-contained data processing task should be encapsulated as a <code>PTransform</code>. This includes the task of connecting to a third-party storage system: ingesting data from somewhere or exporting it to somewhere. Take, for example, Google Cloud Datastore. In the code snippet: <pre class="prettyprint"><code> PCollection<Entity> entities = p.apply(DatastoreIO.readFrom(dataset, query)); ... p.apply(some processing) .apply(DatastoreIO.writeTo(dataset)); </code></pre> the return type of <code>DatastoreIO.readFrom(dataset, query)</code> is a subclass of <code>PTransform<PBegin, PCollection<Entity>></code>, and the type of <code>DatastoreIO.writeTo(dataset)</code> is a subclass of <code>PTransform<PCollection<Entity>, PDone></code>. It is true that these functions are under the hood implemented using the <code>Source</code> and <code>Sink</code> classes, but to a user who just wants to read or write something to Datastore, that's an implementation detail that usually should not matter (however, see the note at the end of this answer about exposing the <code>Source</code> or <code>Sink</code> class). Any connector, or for that matter, any other data processing task is a <code>PTransform</code>. Note: Currently connectors that read from somewhere tend to be <code>PTransform<PBegin, PCollection<T>></code>, and connectors that write to somewhere tend to be <code>PTransform<PCollection<T>, PDone></code>, but we are considering options to make it easier to use connectors in more flexible ways (for example, reading from a <code>PCollection</code> of filenames). However, of course, this detail matters to somebody who wants to implement a new connector. In particular, you may ask: Q: Why do I need the <code>Source</code> and <code>Sink</code> classes at all, if I could just implement my connector as a PTransform? A: If you can implement your connector by just using the built-in transforms (such as <code>ParDo</code>, <code>GroupByKey</code> etc.), that's a perfectly valid way to develop a connector. However, the <code>Source</code> and <code>Sink</code> classes provide some low-level capabilities that, in case you need them, would be cumbersome or impossible to develop yourself. For example, <code>BoundedSource</code> and <code>UnboundedSource</code> provide hooks for controlling how parallelization happens (both initial and dynamic work rebalancing - <code>BoundedSource.splitIntoBundles</code>, <code>BoundedReader.splitAtFraction</code>), while these hooks are not currently exposed for arbitrary <code>DoFn</code>s. You could technically implement a parser for a file format by writing a <code>DoFn<FilePath, SomeRecord></code> that takes the filename as input, reads the file and emits <code>SomeRecord</code>, but this <code>DoFn</code> would not be able to dynamically parallelize reading parts of the file onto multiple workers in case the file turned out to be very large at runtime. On the other hand, <code>FileBasedSource</code> has this capability built-in, as well as handling of glob filepatterns and such. Likewise, you could try implementing a connector to a streaming system by implementing a <code>DoFn</code> that takes a dummy element as input, establishes a connection and streams all elements into <code>ProcessingContext.output()</code>, but <code>DoFn</code>s currently don't support writing unbounded amounts of output from a single bundle, nor do they explicitly support the checkpointing and deduplication machinery needed for the strong consistency guarantees Dataflow gives to streaming pipelines. <code>UnboundedSource</code>, on the other hand, supports all this. <code>Sink</code> (more precisely, the <code>Write.to()</code> <code>PTransform</code>) is also interesting: it is just a composite transform that you could write yourself if you wanted to (i.e. it has no hard-coded support in the Dataflow runner or backend), but it was developed with consideration for typical distributed fault tolerance issues that arise when writing data to a storage system in parallel, and it provides hooks that force you to keep those issues in mind: e.g., because bundles of data are written in parallel, and some bundles may be retried or duplicated for fault tolerance, there is a hook for "committing" just the results of the successfully completed bundles (<code>WriteOperation.finalize</code>). To summarize: using <code>Source</code> or <code>Sink</code> APIs to develop a connector helps you structure your code in a way that will work well in a distributed processing setting, and the source APIs give you access to advanced capabilities of the framework. But if your connector is a very simple one that needs neither, then you are free to just assemble your connector from other built-in transforms. Q: Suppose I decide to make use of <code>Source</code> and <code>Sink</code>. Then how do I package my connector as a library: should I just provide the <code>Source</code> or <code>Sink</code> class, or should I wrap it into a <code>PTransform</code>? A: Your connector should ultimately be packaged as a <code>PTransform</code>, so that the user can just <code>p.apply()</code> it in their pipeline. However, under the hood your transform can use <code>Source</code> and <code>Sink</code> classes. A common pattern is to expose the <code>Source</code> and <code>Sink</code> classes as well, making use of the Fluent Builder pattern, and letting the user wrap them into a <code>Read.from()</code> or <code>Write.to()</code> transform themselves, but this is not a strict requirement.

Source Vs PTransform

1 Answers

The philosophy of Dataflow is that PTransform is the main unit of abstraction and composability, i.e., any self-contained data processing task should be encapsulated as a PTransform. This includes the task of connecting to a third-party storage system: ingesting data from somewhere or exporting it to somewhere.

Take, for example, Google Cloud Datastore. In the code snippet:

Click to copy

    PCollection<Entity> entities =
      p.apply(DatastoreIO.readFrom(dataset, query));
    ...
    p.apply(some processing)
     .apply(DatastoreIO.writeTo(dataset));

the return type of DatastoreIO.readFrom(dataset, query) is a subclass of PTransform<PBegin, PCollection<Entity>>, and the type of DatastoreIO.writeTo(dataset) is a subclass of PTransform<PCollection<Entity>, PDone>.

It is true that these functions are under the hood implemented using the Source and Sink classes, but to a user who just wants to read or write something to Datastore, that's an implementation detail that usually should not matter (however, see the note at the end of this answer about exposing the Source or Sink class). Any connector, or for that matter, any other data processing task is a PTransform.

Note: Currently connectors that read from somewhere tend to be PTransform<PBegin, PCollection<T>>, and connectors that write to somewhere tend to be PTransform<PCollection<T>, PDone>, but we are considering options to make it easier to use connectors in more flexible ways (for example, reading from a PCollection of filenames).

However, of course, this detail matters to somebody who wants to implement a new connector. In particular, you may ask:

Q: Why do I need the Source and Sink classes at all, if I could just implement my connector as a PTransform?

A: If you can implement your connector by just using the built-in transforms (such as ParDo, GroupByKey etc.), that's a perfectly valid way to develop a connector. However, the Source and Sink classes provide some low-level capabilities that, in case you need them, would be cumbersome or impossible to develop yourself.

For example, BoundedSource and UnboundedSource provide hooks for controlling how parallelization happens (both initial and dynamic work rebalancing - BoundedSource.splitIntoBundles, BoundedReader.splitAtFraction), while these hooks are not currently exposed for arbitrary DoFns.

You could technically implement a parser for a file format by writing a DoFn<FilePath, SomeRecord> that takes the filename as input, reads the file and emits SomeRecord, but this DoFn would not be able to dynamically parallelize reading parts of the file onto multiple workers in case the file turned out to be very large at runtime. On the other hand, FileBasedSource has this capability built-in, as well as handling of glob filepatterns and such.

Likewise, you could try implementing a connector to a streaming system by implementing a DoFn that takes a dummy element as input, establishes a connection and streams all elements into ProcessingContext.output(), but DoFns currently don't support writing unbounded amounts of output from a single bundle, nor do they explicitly support the checkpointing and deduplication machinery needed for the strong consistency guarantees Dataflow gives to streaming pipelines. UnboundedSource, on the other hand, supports all this.

Sink (more precisely, the Write.to() PTransform) is also interesting: it is just a composite transform that you could write yourself if you wanted to (i.e. it has no hard-coded support in the Dataflow runner or backend), but it was developed with consideration for typical distributed fault tolerance issues that arise when writing data to a storage system in parallel, and it provides hooks that force you to keep those issues in mind: e.g., because bundles of data are written in parallel, and some bundles may be retried or duplicated for fault tolerance, there is a hook for "committing" just the results of the successfully completed bundles (WriteOperation.finalize).

To summarize: using Source or Sink APIs to develop a connector helps you structure your code in a way that will work well in a distributed processing setting, and the source APIs give you access to advanced capabilities of the framework. But if your connector is a very simple one that needs neither, then you are free to just assemble your connector from other built-in transforms.

Q: Suppose I decide to make use of Source and Sink. Then how do I package my connector as a library: should I just provide the Source or Sink class, or should I wrap it into a PTransform?

A: Your connector should ultimately be packaged as a PTransform, so that the user can just p.apply() it in their pipeline. However, under the hood your transform can use Source and Sink classes.

A common pattern is to expose the Source and Sink classes as well, making use of the Fluent Builder pattern, and letting the user wrap them into a Read.from() or Write.to() transform themselves, but this is not a strict requirement.

104

answered Nov 15 '22 00:11

jkff

Related questions
                            
                                Is it possible to display toString() result in value column in Eclipse debugger?
                            
                                Is there a stack space for every thread?
                            
                                Heap Inspection Security Vulnerability
                            
                                Why do we always type cast in Android/Java?
                            
                                Java 8 streams, lambdas
                            
                                Spring MVC: Fallback for unknown language code in uri parameter value
                            
                                What is the difference between abstract class and interface in terms of their storage in JVM [duplicate]
                            
                                Initialization with string in scientific format in Java BigInteger?
                            
                                Automatic field numbering in String formatting
                            
                                try-catch and final variables [duplicate]
                            
                                File Upload Using Feign - multipart/form-data
                            
                                Is SerializedSubject necessary for thread-safety in RxJava
                            
                                How do I give request parameters for a POST using MockMvc
                            
                                Code coverage finally block
                            
                                No qualifying bean of type [org.springframework.transaction.PlatformTransactionManager] is defined
                            
                                Server Sent Event Client with additional Cookie
                            
                                Why exception is null in ThreadPoolExecutor's afterExecute()?
                            
                                Insufficient geolocation permissions
                            
                                Unable to create solr core
                            
                                Does Tomcat 7 support Java 8?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Source Vs PTransform

Tags:

java

google-cloud-dataflow

pibafe

People also ask

1 Answers

jkff

Recent Activity

Donate For Us