We are building an integration test for an Apache Beam pipeline and are running into some issues. See below for context... Details about our pipeline: <ul> <li>We use <code>PubsubIO</code> as our data source (unbounded <code>PCollection</code>)</li> <li>Intermediate transforms include a custom <code>CombineFn</code> and a very simple windowing/triggering strategy</li> <li>Our final transform is <code>JdbcIO</code>, using <code>org.neo4j.jdbc.Driver</code> to write to Neo4j</li> </ul> Current testing approach: <ul> <li>Run Google Cloud's Pub/Sub emulator on the machine that the tests are running on</li> <li>Build an in-memory Neo4j database and pass its URI into our pipeline options</li> <li>Run pipeline by calling <code>OurPipeline.main(TestPipeline.convertToArgs(options)</code> </li> <li>Use Google Cloud's Java Pub/Sub client library to publish messages to a test topic (using Pub/Sub emulator), which <code>PubsubIO</code> will read from</li> <li>Data should flow through the pipeline and eventually hit our in-memory instance of Neo4j</li> <li>Make simple assertions regarding the presence of this data in Neo4j</li> </ul> This is intended to be a simple integration test which will verify that our pipeline as a whole is behaving as expected. The issue we're currently having is that when we run our pipeline it is blocking. We are using <code>DirectRunner</code> and <code>pipeline.run()</code> (not <code>pipeline.run().waitUntilFinish()</code>), but the test seems to hang after running the pipeline. Because this is an unbounded <code>PCollection</code> (running in streaming mode), the pipeline does not terminate, and thus any code after it is not reached. So, I have a few questions: 1) Is there a way to run a pipeline and then stop it manually later? 2) Is there a way to run a pipeline asynchronously? Ideally it would just kick off the pipeline (which would then continuously poll Pub/Sub for data) and then move on to the code responsible for publishing to Pub/Sub. 3) Is this method of integration testing a pipeline reasonable, or are there better methods that might be more straightforward? Any info/guidance here would be appreciated. Let me know if I can provide any additional code/context - thanks!

You can run the pipeline asynchronously using the <code>DirectRunner</code> by passing setting the <code>isBlockOnRun</code> pipeline option to <code>false</code>. So long as you keep a reference to the returned <code>PipelineResult</code> available, calling <code>cancel()</code> on that result should stop the pipeline. For your third question, your setup seems reasonable. However, if you want to have a smaller-scale test of your pipeline (requiring fewer components), you can encapsulate all of your processing logic within a custom <code>PTransform</code>. This <code>PTransform</code> should take inputs that have been fully parsed from an input source, and produce outputs that are yet to be parsed for the output sink. When this is done, you can use either <code>Create</code> (which will generally not exercise triggering) or <code>TestStream</code> (which may, depending on how you construct the <code>TestStream</code>) with the <code>DirectRunner</code> to generate a finite amount of input data, apply this processing <code>PTransform</code> to that <code>PCollection</code>, and use <code>PAssert</code> on the output <code>PCollection</code> to verify that the pipeline generated the outputs which you expect. For more information about testing, the Beam website has information about these styles of tests in the Programming Guide and a blog post about testing pipelines with <code>TestStream</code>.

Apache Beam - Integration test with unbounded PCollection

Tags:

java

integration-testing

google-cloud-pubsub

apache-beam

google-cloud-dataflow

We are building an integration test for an Apache Beam pipeline and are running into some issues. See below for context...

Details about our pipeline:

We use PubsubIO as our data source (unbounded PCollection)
Intermediate transforms include a custom CombineFn and a very simple windowing/triggering strategy
Our final transform is JdbcIO, using org.neo4j.jdbc.Driver to write to Neo4j

Current testing approach:

Run Google Cloud's Pub/Sub emulator on the machine that the tests are running on
Build an in-memory Neo4j database and pass its URI into our pipeline options
Run pipeline by calling OurPipeline.main(TestPipeline.convertToArgs(options)
Use Google Cloud's Java Pub/Sub client library to publish messages to a test topic (using Pub/Sub emulator), which PubsubIO will read from
Data should flow through the pipeline and eventually hit our in-memory instance of Neo4j
Make simple assertions regarding the presence of this data in Neo4j

This is intended to be a simple integration test which will verify that our pipeline as a whole is behaving as expected.

The issue we're currently having is that when we run our pipeline it is blocking. We are using DirectRunner and pipeline.run() (not pipeline.run().waitUntilFinish()), but the test seems to hang after running the pipeline. Because this is an unbounded PCollection (running in streaming mode), the pipeline does not terminate, and thus any code after it is not reached.

So, I have a few questions:

1) Is there a way to run a pipeline and then stop it manually later?

2) Is there a way to run a pipeline asynchronously? Ideally it would just kick off the pipeline (which would then continuously poll Pub/Sub for data) and then move on to the code responsible for publishing to Pub/Sub.

3) Is this method of integration testing a pipeline reasonable, or are there better methods that might be more straightforward? Any info/guidance here would be appreciated.

Let me know if I can provide any additional code/context - thanks!

397

asked Jun 23 '17 18:06

Chris Staikos

1 Answers

You can run the pipeline asynchronously using the DirectRunner by passing setting the isBlockOnRun pipeline option to false. So long as you keep a reference to the returned PipelineResult available, calling cancel() on that result should stop the pipeline.

For your third question, your setup seems reasonable. However, if you want to have a smaller-scale test of your pipeline (requiring fewer components), you can encapsulate all of your processing logic within a custom PTransform. This PTransform should take inputs that have been fully parsed from an input source, and produce outputs that are yet to be parsed for the output sink.

When this is done, you can use either Create (which will generally not exercise triggering) or TestStream (which may, depending on how you construct the TestStream) with the DirectRunner to generate a finite amount of input data, apply this processing PTransform to that PCollection, and use PAssert on the output PCollection to verify that the pipeline generated the outputs which you expect.

For more information about testing, the Beam website has information about these styles of tests in the Programming Guide and a blog post about testing pipelines with TestStream.

138

answered Sep 30 '22 04:09

Thomas Groh

Related questions
                            
                                Thymeleaf Neither BindingResult nor plain target object for bean name 'person' available as request attribute
                            
                                How to read Heart rate from Android Wear
                            
                                Mapping nested object with mapstruct
                            
                                Can Java lambdas bind methods to their parameters?
                            
                                In Selenium Webdriver, ExpectedCondition.elementToBeClickable is not waiting until the progress bar disappears
                            
                                Use maven JavaDoc with reasonable doclint parameters
                            
                                Java memory leak - jmap doesn't show the classes but jstat does
                            
                                what is 'proxy.mycompany1.local'
                            
                                What actually memory overhead is in java?
                            
                                How to convert recursion to iteration? [closed]
                            
                                How do I run nested collect on java 8 stream
                            
                                Spring security 401 Unauthorized on unsecured endpoint
                            
                                Java - get command line arguments, OUTSIDE of main? [duplicate]
                            
                                CrudRepository custom method implementation?
                            
                                jcmd : where can I find complementary information
                            
                                Extracting information from a a java core dump with jmap(1.5)
                            
                                Can Spring actuator be used with non-web Spring Boot application?
                            
                                Kafka leader election causes Kafka Streams crash
                            
                                Dynamic dropdowns using thymeleaf, spring boot
                            
                                Why don't we add an "s" suffix to short types?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With