Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Batch analog in GCP?

I was using AWS and am new to GCP. One feature I used heavily was AWS Batch, which automatically creates a VM when the job is submitted and deletes the VM when the job is done. Is there a GCP counterpart? Based on my research, the closest is GCP Dataflow. The GCP Dataflow documentation led me to Apache Beam. But when I walk through the examples here (link), it feels totally different from AWS Batch.

Any suggestions on submitting jobs for batch processing in GCP? My requirement is to simply retrieve data from Google Cloud Storage, analyze the data using a Python script, and then put the result back to Google Cloud Storage. The process can take overnight and I don't want the VM to be idle when the job is finished but I'm sleeping.

like image 615
Hung-Yi Wu Avatar asked Jul 06 '18 18:07

Hung-Yi Wu


People also ask

What is batch in GCP?

Batch is a fully managed service that lets you schedule, queue, and execute batch processing workloads on Compute Engine virtual machine (VM) instances. Batch provisions resources and manages capacity on your behalf, allowing your batch workloads to run at scale.

What is GCP equivalent of EC2?

Compute Engine is the service offering on the Google Cloud Platform, while Amazon Web Services is named Amazon Elastic Compute Cloud (Amazon EC2).

What is AWS batch used for?

AWS Batch is a set of batch management capabilities that enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS.

What is difference between GCP and AWS?

AWS and GCP each provide a command-line interface (CLI) for interacting with the services and resources. AWS provides the Amazon CLI, and GCP provides the Cloud SDK. Each is a unified CLI for all services, and each is cross-platform, with binaries available for Windows, Linux, and macOS.


5 Answers

I recommend checking out dsub. It's an open-source tool initially developed by the Google Genomics teams for doing batch processing on Google Cloud.

like image 191
Paul Billing-Ross Avatar answered Oct 13 '22 12:10

Paul Billing-Ross


The Product that best suits your use-case in GCP is Cloud Task. We are using it for a similar use-case where we are retrieving files from another HTTP server and after some processing storing them in Google Cloud Storage.

This GCP documentation describes in full detail the steps to create tasks and using them.

You schedule your task programmatically in Cloud Tasks and you have to create task handlers(worker services) in the App Engine. Some limitation For worker services running in App Engine

  • the standard environment:

    • Automatic scaling: task processing must finish in 10 minutes.
    • Manual and basic scaling: requests can run up to 24 hours.
  • the flex environment: all types have a 60 minutes timeout.

like image 40
Ilyas Avatar answered Oct 13 '22 13:10

Ilyas


You can do this using AI Platform Jobs which is now able to run arbitrary docker images:

gcloud ai-platform jobs submit training $JOB_NAME \
       --scale-tier BASIC \ 
       --region $REGION \ 
       --master-image-uri gcr.io/$PROJECT_ID/some-image

You can define the master instance type and even additional worker instances if desired. They should consider creating a sibling product without the AI buzzword so people can find this functionality easier.

like image 11
Cristian Garcia Avatar answered Oct 13 '22 12:10

Cristian Garcia


UPDATE: I have now used this service and I think it's awesome.

As of July 13, 2022, GCP now has it's own new fully managed Batch processing service (GCP Batch), which seems very akin to AWS Batch.

See the GCP Blog post announcing it at: https://cloud.google.com/blog/products/compute/new-batch-service-processes-batch-jobs-on-google-cloud (with links to docs as well)

like image 4
Max Power Avatar answered Oct 13 '22 12:10

Max Power


Officially, according to the "Map AWS services to Google Cloud Platform products" page, there is no direct equivalent but you can put a few things together that might get you to get close.

I wasn't sure if you were or had the option to run your python code in Docker. Then the Kubernetes controls might do the trick. From the GCP docs:

Note: Beginning with Kubernetes version 1.7, you can specify a minimum size of zero for your node pool. This allows your node pool to scale down completely if the instances within aren't required to run your workloads. However, while a node pool can scale to a zero size, the overall cluster size does not scale down to zero nodes (as at least one node is always required to run system Pods).

So, if you are running other managed instances anyway you can scale up or down to and from 0 but you have the Kubernetes node is still active and running the pods.

I'm guessing you are already using something like "Creating API Requests and Handling Responses" to get an ID you can verify that the process is started, instance created, and the payload is processing. You can use that same process to submit that the process completes as well. That takes care of the instance creation and launch of the python script.

You could use Cloud Pub/Sub. That can help you keep track of the state of that: can you modify your python to notify the completion of the task? When you create the task and launch the instance, you can also report that the python job is complete and then kick off an instance tear down process.

Another thing you can do to drop costs is to use Preemptible VM Instances so that the instances run at 1/2 cost and will run a maximum of 1 day anyway.

Hope that helps.

like image 3
Roy Tokeshi Avatar answered Oct 13 '22 14:10

Roy Tokeshi