Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problem in specifying the network in cloud dataflow

I didn't configure the project and I get this error whenever I run my job 'The network default doesn't have rules that open TCP ports 1-65535 for internal connection with other VMs. Only rules with a target tag 'dataflow' or empty target tags set apply. If you don't specify such a rule, any pipeline with more than one worker that shuffles data will hang. Causes: No firewall rules associated with your network.'

google_cloud_options = p_options.view_as(GoogleCloudOptions)
google_cloud_options.region = 'europe-west1'
google_cloud_options.project = 'my-project'
google_cloud_options.job_name = 'rim'
google_cloud_options.staging_location = 'gs://my-bucket/binaries'
google_cloud_options.temp_location = 'gs://my-bucket/temp'
p_options.view_as(StandardOptions).runner = 'DataflowRunner'
p_options.view_as(SetupOptions).save_main_session = True
p_options.view_as(StandardOptions).streaming = True
p_options.view_as(WorkerOptions).subnetwork = 'regions/europe-west1/subnetworks/test'
p = beam.Pipeline(options=p_options)

I tried to specify --network 'test' in the command line since it is not the default configuration enter image description here

like image 820
Rim Avatar asked Jul 30 '19 10:07

Rim


People also ask

How do you rerun a failed job in Dataflow?

hi @sɐunıɔןɐqɐp, there is an option clone at the top of the jobs dashboard in the dataflow. It will simply clone the current job and let you edit and run it again.

What does Cloud Dataflow use to support fast and simplified pipeline development?

The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines. You create your pipelines with an Apache Beam program and then run them on the Dataflow service.

What kind of pipeline can be built using Dataflow?

Dataflow has two data pipeline types: streaming and batch. Both types of pipelines run jobs that are defined in Dataflow templates. A streaming data pipeline runs a Dataflow streaming job immediately after it is created. A batch data pipeline runs a Dataflow batch job on a user-defined schedule.

What is staging location in Dataflow?

staging_location : a Cloud Storage path for Dataflow to stage temporary job files created during the execution of the pipeline.


2 Answers

It looks like your default firewall rules were modified and dataflow detected this and prevented your job from launching. Could you verify your firewall rules were not modified in your project?. Please take a look at the documentation here. You will also find a command here to restore the firewall rules:

gcloud compute firewall-rules create [FIREWALL_RULE_NAME] \
    --network [NETWORK] \
    --action allow \
    --direction ingress \
    --target-tags dataflow \
    --source-tags dataflow \
    --priority 0 \
    --rules tcp:1-65535

Pick a name for the firewall, and provide a network name. Then pass in the network name with --network when you launch the dataflow job. If you have a network named 'default' dataflow will try to use that automatically, so you won't need to pass in --network. If you've deleted that network you may wish to recreate it.

like image 181
Alex Amato Avatar answered Oct 11 '22 13:10

Alex Amato


As of now, till apache beam version 2.19.0. There is no provision from dataflow to set network tag for its VM. Instead while creating firewall rule, we should add a tag for dataflow.

gcloud compute firewall-rules create FIREWALL_RULE_NAME \
    --network NETWORK \
    --action allow \
    --direction DIRECTION \
    --target-tags dataflow \
    --source-tags dataflow \
    --priority 0 \
    --rules tcp:12345-12346

See this link for more details https://cloud.google.com/dataflow/docs/guides/routes-firewall

like image 41
Aditya Avatar answered Oct 11 '22 11:10

Aditya