Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

worker_machine_type tag not working in Google Cloud Dataflow with python

I am using Apache Beam in Python with Google Cloud Dataflow (2.3.0). When specifying the worker_machine_type parameter as e.g. n1-highmem-2 or custom-1-6656, Dataflow runs the job but always uses the standard machine type n1-standard-1 for every worker.

Does anyone have an idea if I am doing something wrong?

Other topics (here and here) show that this should be possible, so this might be a version issue.

My code for specifying PipelineOptions (note that all other options do work fine, so it should recognize the worker_machine_type parameter):

def get_cloud_pipeline_options(project):

  options = {
    'runner': 'DataflowRunner',
    'job_name': ('converter-ml6-{}'.format(
        datetime.now().strftime('%Y%m%d%H%M%S'))),
    'staging_location': os.path.join(BUCKET, 'staging'),
    'temp_location': os.path.join(BUCKET, 'tmp'),
    'project': project,
    'region': 'europe-west1',
    'zone': 'europe-west1-d',
    'autoscaling_algorithm': 'THROUGHPUT_BASED',
    'save_main_session': True,
    'setup_file': './setup.py',
    'worker_machine_type': 'custom-1-6656',
    'max_num_workers': 3,
  }

  return beam.pipeline.PipelineOptions(flags=[], **options)

def main(argv=None):
  args = parse_arguments(sys.argv if argv is None else argv)

  pipeline_options = get_cloud_pipeline_options(args.project_id

  pipeline = beam.Pipeline(options=pipeline_options)
like image 957
dumkar Avatar asked Apr 12 '18 07:04

dumkar


2 Answers

PipelineOptions uses argparse behind the scenes to parse its argument. In the case of machine type, the name of the argument is machine_type however the flag name is worker_machine_type. This works fine in the following two cases, where argparse does its parsing and is aware of this aliasing:

  1. Passing arguments on the commandline. e.g. my_pipeline.py --worker_machine_type custom-1-6656
  2. Passing arguments as a command line flags e.g. flags['--worker_machine_type', 'worker_machine_type custom-1-6656', ...]

However it does not work well with **kwargs. Any additional args passed in that way are used to substitute for known argument names (but not flag names).

In short, using machine_type would work everywhere. I filed https://issues.apache.org/jira/browse/BEAM-4112 for this to be fixed in Beam in the future.

like image 116
Szere Dyeri Avatar answered Oct 01 '22 19:10

Szere Dyeri


This can be solved by using the flag machine_type instead of worker_machine_type. The rest of the code works fine.

The documentation is thus mentioning the wrong field name.

like image 28
dumkar Avatar answered Oct 01 '22 19:10

dumkar