Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue: passing additional Python modules to the job - ModuleNotFoundError

I'm trying to run a Glue job (version 4) to perform a simple data batch processing. I'm using additional python libraries that Glue environment doesn't provide with - translate and langdetect. Additionally, regardless of the Glue env provides with 'nltk' package, when I try to import it I keep receiving the error that dependencies are not found (e.g. regex._regex, _sqlite3).

I tried a few solutions to achieve my goal:

  1. using --extra-py-files where I specified path to s3 bucket where I uploaded either:
  • .zip file that consists of translate and langdetect python packages
  • just a directory for already unzipped packages
  • packages itself in .whl format (along with its dependencies)
  1. using --additional-python-modules where I specified path to s3 bucket where I uploaded:
  • packages itself in .whl format (along with its dependencies)
  • or just pinpoint which package has to be installed inside the glue env via pip3
  1. using Docker

Additionally, I followed a few useful sources to overcome the issue of ModuleNotFoundError:

a) https://aws.amazon.com/premiumsupport/knowledge-center/glue-import-error-no-module-named/.

b) https://aws.amazon.com/premiumsupport/knowledge-center/glue-version2-external-python-libraries/

c) https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html

Also, I tried to play with the Glue versions 4 and 3 but haven't had luck. It seems like a bug. All permissions to read s3 bucket is granted to the glue role. The Python script version is the same as the libraries I'm trying to install - Python 3. To give you more clues, I manage glue resources via Terraform.

What did I do wrong?

like image 295
Kangoor Avatar asked Oct 18 '25 15:10

Kangoor


1 Answers

The way I have been able to achieve this is in AWS Glue 4.0 is by taking the following steps: Under the Job Details tab, scroll down to Advanced Properties and expand that section. Locate the Job parameters region and add a New Parameter. For key, enter the text below: --additional-python-modules For value, enter your package name as found in the pyp.org. Example: PyMySQL==1.0.3,SQLAlchemy==2.0.19 or in your case: langdetect==1.0.9,translate==3.6.1

For each package, use a comma to separate them. This process is a lot easier than zipping packages and uploading to s3.

like image 178
James Leighton Avatar answered Oct 20 '25 04:10

James Leighton



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!