I've built an AI Platform pipeline with a lot of parallel processes. Each process launches a training job on the AI Platform, like this:
gcloud ai-platform jobs submit training ...
Then it has to wait for the job to finish to pass to the next step. For doing this, I've tried to add the parameter --stream-logs to the above command. In this way, it streams all the logs until the job is done.
The problem is, with so many parallel processes, I run out of requests for getting logs:
Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute'
of service 'logging.googleapis.com'
But I do not need to actually stream the logs, I just need a way to tell the process to "wait" until the training job is done. Is there a smarter and simpler way of doing this?
I've just found that I can use the Python API to launch and monitor the job:
training_inputs = {
'scaleTier': 'CUSTOM',
'masterType': 'n1-standard-8',
...
}
job_spec = {'jobId': 'your_job_name', 'trainingInput': training_inputs}
project_name = 'your-project'
project_id = 'projects/{}'.format(project_name)
cloudml = discovery.build('ml', 'v1')
request = cloudml.projects().jobs().create(
body=job_spec,
parent=project_id
)
response = request.execute()
Now I can set up a loop that checks the job state every 60 seconds
state = 'RUNNING'
while state == 'RUNNING':
time.sleep(60)
status_req = cloudml.projects().jobs().get(name=f'{project_id}/jobs/{job_name}')
state = status_req.execute()['state']
print(state)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With