Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google AI Platform training - wait for the job to finish

I've built an AI Platform pipeline with a lot of parallel processes. Each process launches a training job on the AI Platform, like this:

gcloud ai-platform jobs submit training ... 

Then it has to wait for the job to finish to pass to the next step. For doing this, I've tried to add the parameter --stream-logs to the above command. In this way, it streams all the logs until the job is done.

The problem is, with so many parallel processes, I run out of requests for getting logs:

Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute' 
of service 'logging.googleapis.com'

But I do not need to actually stream the logs, I just need a way to tell the process to "wait" until the training job is done. Is there a smarter and simpler way of doing this?

like image 604
Matteo Felici Avatar asked Oct 19 '25 04:10

Matteo Felici


1 Answers

I've just found that I can use the Python API to launch and monitor the job:

training_inputs = {
    'scaleTier': 'CUSTOM',
    'masterType': 'n1-standard-8',
    ...
}

job_spec = {'jobId': 'your_job_name', 'trainingInput': training_inputs}


project_name = 'your-project'
project_id = 'projects/{}'.format(project_name)


cloudml = discovery.build('ml', 'v1')

request = cloudml.projects().jobs().create(
    body=job_spec,
    parent=project_id
)
response = request.execute()

Now I can set up a loop that checks the job state every 60 seconds

state = 'RUNNING'
while state == 'RUNNING':

    time.sleep(60)
    status_req = cloudml.projects().jobs().get(name=f'{project_id}/jobs/{job_name}')

    state = status_req.execute()['state']

    print(state)
like image 156
Matteo Felici Avatar answered Oct 21 '25 17:10

Matteo Felici



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!