Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RuntimeValueProviderError when creating a google cloud dataflow template with Apache Beam python

I can't stage a cloud dataflow template with python 3.7. It fails on the one parametrized argument with apache_beam.error.RuntimeValueProviderError: RuntimeValueProvider(option: input, type: str, default_value: 'gs://dataflow-samples/shakespeare/kinglear.txt') not accessible

Staging the template with python 2.7 works fine.

I have tried running dataflow jobs with 3.7 and they work fine. Only the template staging is broken. Is python 3.7 still not supported in dataflow templates or did the syntax for staging in python 3 change?

Here is the pipeline piece

class WordcountOptions(PipelineOptions):
  @classmethod
  def _add_argparse_args(cls, parser):
    parser.add_value_provider_argument(
      '--input',
      default='gs://dataflow-samples/shakespeare/kinglear.txt',
      help='Path of the file to read from',
      dest="input")

def main(argv=None):
  options = PipelineOptions(flags=argv)
  setup_options = options.view_as(SetupOptions)

  wordcount_options = options.view_as(WordcountOptions)

  with beam.Pipeline(options=setup_options) as p:
    lines = p | 'read' >> ReadFromText(wordcount_options.input)

if __name__ == '__main__':
  main()

Here is the full repo with the staging scripts https://github.com/firemuzzy/dataflow-templates-bug-python3

There was a previous similar issues, but am not sure how related it is since that was done in python 2.7 but my template stages fine in 2.7 but fails in 3.7

How to create Google Cloud Dataflow Wordcount custom template in Python?

**** Stack Trace ****

Traceback (most recent call last):
  File "run_pipeline.py", line 44, in <module>
    main()
  File "run_pipeline.py", line 41, in main
    lines = p | 'read' >> ReadFromText(wordcount_options.input)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py", line 906, in __ror__
    return self.transform.__ror__(pvalueish, self.label)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py", line 515, in __ror__
    result = p.apply(self, pvalueish, label)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/pipeline.py", line 490, in apply
    return self.apply(transform, pvalueish)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/pipeline.py", line 525, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 183, in apply
    return m(transform, input, options)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 189, in apply_PTransform
    return transform.expand(input)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/textio.py", line 542, in expand
    return pvalue.pipeline | Read(self._source)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py", line 515, in __ror__
    result = p.apply(self, pvalueish, label)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/pipeline.py", line 525, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 183, in apply
    return m(transform, input, options)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1020, in apply_Read
    return self.apply_PTransform(transform, pbegin, options)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 189, in apply_PTransform
    return transform.expand(input)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/iobase.py", line 863, in expand
    return pbegin | _SDFBoundedSourceWrapper(self.source)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/pvalue.py", line 113, in __or__
    return self.pipeline.apply(ptransform, self)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/pipeline.py", line 525, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 183, in apply
    return m(transform, input, options)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/runner.py", line 189, in apply_PTransform
    return transform.expand(input)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/iobase.py", line 1543, in expand
    | core.ParDo(self._create_sdf_bounded_source_dofn()))
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/iobase.py", line 1517, in _create_sdf_bounded_source_dofn
    estimated_size = source.estimate_size()
  File "/usr/local/lib/python3.7/site-packages/apache_beam/options/value_provider.py", line 136, in _f
    raise error.RuntimeValueProviderError('%s not accessible' % obj)
apache_beam.error.RuntimeValueProviderError: RuntimeValueProvider(option: input, type: str, default_value: 'gs://dataflow-samples/shakespeare/kinglear.txt') not accessible
like image 618
mlablablab Avatar asked Jan 27 '20 22:01

mlablablab


People also ask

Is Google dataflow Apache Beam?

Cloud Dataflow: Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem.

What is PCollection and Ptransform in dataflow?

A PCollection can contain either a bounded or unbounded number of elements. Bounded and unbounded PCollections are produced as the output of PTransforms (including root PTransforms like Read and Create ), and can be passed as the inputs of other PTransforms.


1 Answers

Unfortunately, it looks like templates are broken on Apache Beam's Python SDK 2.18.0.

For now, the solution to this is to avoid Beam 2.18.0, so in your requirements / dependencies, define apache-beam[gcp]<2.18.0 or apache-beam[gcp]>2.18.0

like image 69
Pablo Avatar answered Oct 21 '22 09:10

Pablo