Google AI Platform training - wait for the job to finish

Question

I've built an AI Platform pipeline with a lot of parallel processes. Each process launches a training job on the AI Platform, like this:

gcloud ai-platform jobs submit training ...

Then it has to wait for the job to finish to pass to the next step. For doing this, I've tried to add the parameter --stream-logs to the above command. In this way, it streams all the logs until the job is done.

The problem is, with so many parallel processes, I run out of requests for getting logs:

Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute' 
of service 'logging.googleapis.com'

But I do not need to actually stream the logs, I just need a way to tell the process to "wait" until the training job is done. Is there a smarter and simpler way of doing this?

Matteo Felici · Accepted Answer

I've just found that I can use the Python API to launch and monitor the job:

training_inputs = {
    'scaleTier': 'CUSTOM',
    'masterType': 'n1-standard-8',
    ...
}

job_spec = {'jobId': 'your_job_name', 'trainingInput': training_inputs}


project_name = 'your-project'
project_id = 'projects/{}'.format(project_name)


cloudml = discovery.build('ml', 'v1')

request = cloudml.projects().jobs().create(
    body=job_spec,
    parent=project_id
)
response = request.execute()

Now I can set up a loop that checks the job state every 60 seconds

state = 'RUNNING'
while state == 'RUNNING':

    time.sleep(60)
    status_req = cloudml.projects().jobs().get(name=f'{project_id}/jobs/{job_name}')

    state = status_req.execute()['state']

    print(state)

Google AI Platform training - wait for the job to finish

Tags:

machine-learning

google-cloud-platform

gcp-ai-platform-training

google-ai-platform

google-cloud-vertex-ai

Matteo Felici

1 Answers

Matteo Felici

Recent Activity

Donate For Us

Google AI Platform training - wait for the job to finish

Tags:

machine-learning

google-cloud-platform

gcp-ai-platform-training

google-ai-platform

google-cloud-vertex-ai

Matteo Felici

1 Answers

Matteo Felici

Related questions

Recent Activity

Donate For Us