Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dataflow/apache beam: manage custom module dependencies

I have a .py pipeline using apache beam that import another module (.py), that is my custom module. I have a strucutre like this:

├── mymain.py
└── myothermodule.py

I import myothermodule.py in mymain.py like this:

import myothermodule

When I run locally on DirectRuner, I have no problem. But when I run it on dataflow with DataflowRunner, I have an error that tells:

ImportError: No module named myothermodule

So I want to know what should I do if I whant this module to be found when running the job on dataflow?

like image 715
mee Avatar asked Aug 09 '18 09:08

mee


People also ask

What is PCollection and PTransform in dataflow?

A PCollection can contain either a bounded or unbounded number of elements. Bounded and unbounded PCollections are produced as the output of PTransforms (including root PTransforms like Read and Create ), and can be passed as the inputs of other PTransforms.

What is PCollection in dataflow?

PCollection. A PCollection represents a potentially distributed, multi-element dataset that acts as the pipeline's data. Apache Beam transforms use PCollection objects as inputs and outputs for each step in your pipeline.

What is PTransform in Apache beam?

A PTransform<InputT, OutputT> is an operation that takes an InputT (some subtype of PInput ) and produces an OutputT (some subtype of POutput ). Common PTransforms include root PTransforms like TextIO.

What is DoFn in Apache beam?

DoFn is a Beam SDK class that describes a distributed processing function.


1 Answers

When you run your pipeline remotely, you need to make any dependencies available on the remote workers too. To do it you should put your module file in a Python package by putting it in a directory with a __init__.py file and creating a setup.py. It would look like this:

├── mymain.py
├── setup.py
└── othermodules
    ├── __init__.py
    └── myothermodule.py

And import it like this:

from othermodules import myothermodule

Then you can run you pipeline with the command line option --setup_file ./setup.py

A minimal setup.py file would look like this:

import setuptools

setuptools.setup(packages=setuptools.find_packages())

The whole setup is documented here.

And a whole example using this can be found here.

like image 154
rilla Avatar answered Oct 23 '22 14:10

rilla