Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dataflow Pipeline - "Processing stuck in step <STEP_NAME> for at least <TIME> without outputting or completing in state finish..."

The Dataflow pipelines developed by my team suddenly started getting stuck, stopping processing our events. Their worker logs became full of warning messages saying that one specific step got stuck. The peculiar thing is that the steps that are failing are different, one is a BigQuery output and another for Cloud Storage output.

The following are the log messages that we are receiving:

For BigQuery output:

Processing stuck in step <STEP_NAME>/StreamingInserts/StreamingWriteTables/StreamingWrite for at least <TIME> without outputting or completing in state finish
  at sun.misc.Unsafe.park(Native Method)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
  at java.util.concurrent.FutureTask.get(FutureTask.java:191)
  at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:765)
  at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:829)
  at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:131)
  at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:103)
  at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn$DoFnInvoker.invokeFinishBundle(Unknown Source)

For Cloud Storage output:

Processing stuck in step <STEP_NAME>/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles for at least <TIME> without outputting or completing in state process
  at sun.misc.Unsafe.park(Native Method)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
  at java.util.concurrent.FutureTask.get(FutureTask.java:191)
  at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:421)
  at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:287)
  at org.apache.beam.sdk.io.FileBasedSink$Writer.close(FileBasedSink.java:1007)
  at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:726)
  at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)

All applications have been drained and redeployed but the same thing happened after a while (period of 3 to 4 hours). Some of them were running for more than 40 days and they suddenly got into this without any changes in the code.

I would like ask for some help to know the reason of this problem. These are the following ids of some of the Dataflow jobs with those problems:

Stuck in BigQuery output: 2019-03-04_04_46_31-3901977107649726570

Stuck in Cloud Storage output: 2019-03-04_07_50_00-10623118563101608836

like image 951
Caio Riva Avatar asked Mar 04 '19 19:03

Caio Riva


3 Answers

I'm having the same issue, I’ve found out that the most common case it’s because one of the jobs failed to insert into the BigQuery table or failed saving the file into the CGS bucket (very uncommon). The thread in charge is not catching the Exception and keeps waiting the job. This is a bug of Apache Beam and I already created a ticket for it.

https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-7693

Let’s see if the guys from Apache Beam fix this issue (it’s a literally an unhandled exception).

So far my recommendation is to validate the constraints of your data before the insertion. So keep in mind things like: 1) Max Row size (right now 2019 is 1MB for stream insert and 100MB for batch) 2) REQUIRED values that are not coming should create a dead letter before and not being able to reach the job 3) If you have unknown fields don’t forget to enable the option ignoreUnknownFields (otherwise they will make your job die)

I presume that you are only having issues during the peak hours because more “unsatisfied” events are coming.

Hopefully this could help a little bit

like image 66
Juan Urrego Avatar answered Oct 13 '22 23:10

Juan Urrego


I was running into the same error and reason was that I created an empty BigQuery table without specifying a schema. Make sure to create a BQ table with a schema before you can load data via Dataflow.

like image 27
Zeeshan Avatar answered Oct 13 '22 23:10

Zeeshan


The Processing stuck messages do not necessarily imply that your pipeline is actually stuck. These messages are logged by a worker that has been performing the same operation for over 5 minutes.

Often, this simply indicates a slow operation: An external RPC, or waiting for an external process (very common when performing Load or Query jobs to BigQuery).

If you see this kind of messages happening a lot in your pipeline, or increasingly at higher numbers (5m, 10m, 50m, 1h, etc), then it probably indicates stuckness - but if you see it occasionally in your pipeline, then it's nothing to worry about.


It is worth considering that in older versions of Beam (from 2.5.0 to 2.8.0), there was a deadlock issue with the Conscrypt library which was being used as default security provider. As of Beam 2.9.0, Conscrypt is no longer the default security provider.

Another option is to downgrade to Beam 2.4.0, where conscrypt was also not the default provider.

like image 5
Pablo Avatar answered Oct 13 '22 23:10

Pablo