I'm currently running jobs on Vertex AI and I encountered the following problem :
"error": {
"code": 429,
"message": "The following quota metrics exceed quota limits: aiplatform.googleapis.com/custom_model_training_nvidia_p4_gpus",
"status": "RESOURCE_EXHAUSTED"
}
Last Friday, I had this error, and Monday, it worked again. Since then, I ran 8 jobs and the error came back.
I read Google documentation on Quotas and checked Quotas on IAM and Admin, but I didn't really understand it. It didn't seem that I exceeded something. Could someone explain to me how quotas work?
That particular quota aiplatform.googleapis.com/custom_model_training_nvidia_p4_gpus appears to be the same as "Number of concurrent P4 GPUs for training, per region" listed in the Vertex AI quotas doc. As I understand it, this quota means that you cannot have training running concurrently that uses more than the quota at any given time. So, for example, if you're training in us-central1, which has a default quota limit of 6 for P4 GPUs, all your training jobs currently running cannot use more than 6 P4 GPUs in total.
Some options to address this:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With