Is that possible to build a AWS EMR with a master node and set of task(slave) nodes (with out core nodes),when I am sure that source data is in S3 and processed result is going to be stored in S3.
Basically, the question is "what is the need of having Datanode process when EMR is going to process the data in S3 " ( where we do not store and use anything in HDFS).
Limitations of an EMR cluster with multiple master nodes: If any two master nodes fail simultaneously, Amazon EMR cannot recover the cluster. Amazon EMR clusters with multiple master nodes are not tolerant to Availability Zone failures.
It is a fully managed application with single sign-on, fully managed Jupyter Notebooks, automated infrastructure provisioning, and the ability to debug jobs without logging into the AWS Console or cluster.
EMR is not one of the services offered in free tier. If you are just learning how spark works you don't need an EMR cluster. You can play around on a t2. micro.
Core nodes in EMR provide compute resources as well as HDFS. In Hadoop 2.x this would be provided by YARN NodeManager. Even if an application's input and output are both on S3, YARN (and often other app layers like Hive) utilizes HDFS to stage jars, split info, session data, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With