Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to install python packages within Amazon Sagemaker Processing Job?

I am trying to create a Sklearn processing job in Amazon Sagemekar to perform some data transformation of my input data before I do model training.

I wrote a custom python script preprocessing.py which does the needful. I use some python package in this script. Here is the Sagemaker example I followed.

When I try to submit the Processing Job I get an error -

............................Traceback (most recent call last):
  File "/opt/ml/processing/input/code/preprocessing.py", line 6, in <module>
    import snowflake.connector
ModuleNotFoundError: No module named 'snowflake.connector'

I understand that my processing job is unable to find this package and I need to install it. My question is how can I accomplish this using Sagemaker Processing Job API? Ideally there should be a way to define a requirements.txt in the API call, but I don't see such functionality in the docs.

I know I can create a custom Image with relevant packages and later use this image in the Processing Job, but this seems too much work for something that should be built-in?

Is there an easier/elegant way to install packages needed in Sagemaker Processing Job ?

like image 391
iCHAIT Avatar asked Apr 07 '26 21:04

iCHAIT


2 Answers

One way would be to call pip from Python:

subprocess.check_call([sys.executable, "-m", "pip", "install", package])

Another way would be to use an SKLearn Estimator (training job) instead, to do the same thing. You can provide the source_dir, which can include a requirements.txt file, and these requirements will be installed for you

estimator = SKLearn(
    entry_point="foo.py",
    source_dir="./foo", # no trailing slash! put requirements.txt here
    framework_version="0.23-1",
    role = ...,
    instance_count = 1,
    instance_type = "ml.m5.large"
)
like image 55
Neil McGuigan Avatar answered Apr 09 '26 11:04

Neil McGuigan


Another thing you can do is by having a bash script instead of a python file as entrypoint.

I defined the entrypoint in CloudFormation like this:

...
MyProcessingJob:
  Type: Task
  Resource: arn:aws:states:::sagemaker:createProcessingJob.sync
  Parameters:
    AppSpecification:
      ContainerEntrypoint: ["bash", "/opt/ml/processing/input/code/start_process.sh"]
      ImageUri: "492215442770.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3"
...

The bash script could look something like:

# cd into folder
cd /opt/ml/processing/input/code

# install requirements
pip install -r requirements.txt

# start preprocessing
/miniconda3/bin/python -m entryscript --parameter value

This is what I used on the standard SKLearn Docker Image and it works great.

like image 39
Lukas Hestermeyer Avatar answered Apr 09 '26 09:04

Lukas Hestermeyer



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!