Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue vs EMR Serverless

Recently, AWS announced Amazon EMR Serverless (Preview) https://aws.amazon.com/blogs/big-data/announcing-amazon-emr-serverless-preview-run-big-data-applications-without-managing-servers/ - new very promising service.

From my understanding - AWS Glue is a managed service on top of Apache Spark (for transformation layer). AWS EMR is mostly used for Apache Spark as well. So EMR Serverless (for Apache Spark) looks like is something pretty much similar to AWS Glue.

Right now I have one question in my mind - what is the core difference from AWS Glue and when to choose EMR Serverless over Glue?

Potentially EMR Serverless, may be even a part of AWS Glue ecosystem for transformation layer? Maybe AWS is going to replace the transformation layer in AWS Glue with EMR Serverless, and then it may make sense. AWS Glue will play a role of ETL Overlay, Metastore with EMR Serverless as processing layer.

like image 268
alexanoid Avatar asked Nov 15 '22 17:11

alexanoid


1 Answers

I'll give you my two cents about this because I've been wondering the same thing.

Glue

As per AWS documentation, AWS Glue is "Simple, scalable, and serverless data integration". Glue can be used for a variety of things: as a metadata repository, automatic schema discovery, code generation, and run ETL pipelines to prepare data. Glue takes care of providing and managing the computation resources needed to run your data pipelines. Glue is a serverless service, so you don't need to create and manage the infrastructure, because Glue does it for you.

If we focus only on the processing feature and discard the Glue-specific features (schema discovery, code generation, etc) then EMR Serverless and Glue services look almost identical. One of the key advantages of both services is the ability to run Spark or Hive serverless applications.

What advantage will EMR Serverless have over Glue Spark jobs?

To run Glue, you must either specify MaxCapacity (for Glue version 1.0 or earlier jobs) or Worker type and the Number of workers (for Glue version 2.0 jobs). Both options assume, first, that there is some understanding of the data and workload per cluster, and second, that the workload during job execution will be uniform, i.e., there will be no over- or under- utilization of the provisioned resources.

EMR Serverless

EMR Serverless is a new deployment option for AWS EMR. With EMR Serverless, you don't need to configure, optimize, protect, or manage clusters to run applications on these platforms. EMR Serverless helps you avoid over- or under-allocation of resources to process jobs at the individual stage level.

EMR Serverless automatically identifies the resources needed by jobs, provisions those resources to run the jobs, and releases them when the jobs are completed. In cases where applications require a response within seconds, such as interactive data analysis, the engineer can pre-initialize the necessary resources during application creation. This provides easy initialization, fast job startup, automatic capacity management, and simple cost control.

More info: https://luminousmen.com/post/emr-serverless-a-400level-guide

like image 182
luminousmen Avatar answered Dec 20 '22 07:12

luminousmen