Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python apache beam ImportError: No module named *** on dataflow worker

Summary: Some local packages works and some doesn't

My beam application's structure:

-setup.py

-app/__init__.py
-app/main.py

-package1/__init__.py
-package1/one.py

-package2/__init__.py
-package2/two.py

-package3/__init__.py
-package3/three.py

In main.py:

from package1 import one
from package2 import two
from package3 import three

In setup.py

import setuptools

setuptools.setup(
    name='beam',
    version='1.0',
    install_requires=['apache-beam[gcp]',
                      'google-cloud==0.34.0',
                      'google-cloud-bigquery==0.25.0',
                      'requests==2.19.1',
                      'google-cloud-storage==1.12.0'
                      ],
    packages=setuptools.find_packages(),
)

When running, by using python -m app.main :

With direct runner (locally run), no problem.

With DataflowRunner (send to gogole dataflow), I have this error:

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error: Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 642, in do_work work_executor.execute() File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 156, in execute op.start() File "apache_beam/runners/worker/operations.py", line 344, in apache_beam.runners.worker.operations.DoOperation.start def start(self): File "apache_beam/runners/worker/operations.py", line 345, in apache_beam.runners.worker.operations.DoOperation.start with self.scoped_start_state: File "apache_beam/runners/worker/operations.py", line 350, in apache_beam.runners.worker.operations.DoOperation.start pickler.loads(self.spec.serialized_fn)) File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 244, in loads return dill.loads(s) File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 316, in loads return load(file, ignore) File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 304, in load obj = pik.load() File "/usr/lib/python2.7/pickle.py", line 864, in load dispatchkey File "/usr/lib/python2.7/pickle.py", line 1096, in load_global klass = self.find_class(module, name) File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 465, in find_class return StockUnpickler.find_class(self, module, name) File "/usr/lib/python2.7/pickle.py", line 1130, in find_class import(module) ImportError: No module named three

This is "a bit" frustrating because I double/triple/... check what can be the difference between those packages, and they are the same. Sane __init__.py file (empty, no weird or hidden characters in them). Same type of structure in *.py. But for some reason, the package 3 just doesn't want to cooperate.

Does anyone have a solution for this problem?

Thank you.

like image 513
Xitrum Avatar asked Oct 22 '25 06:10

Xitrum


1 Answers

It's been almost a year, but I had a very similar issue and was able to resolve it, so posting for others stumbling onto this page.

In my case, there is nothing special about package3.three, it just happens to be the first one that the worker tries to import. In fact, removing package3.three (e.g. by temporarily including its contents directly in main.py) leads to the same error with one of the other modules.

While I don't fully understand the root cause, running with a file invocation python app/main.py rather than the module invocation python -m app.main resolved the issue. I'm guessing there is some conflict between the packaging in setup.py and the implicit packaging in module invocation.

like image 186
Aritra Biswas Avatar answered Oct 24 '25 21:10

Aritra Biswas