Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ClusterID vs JobFlowID on AWS EMR

I am a bit confused about the APIs available and the two identifiers. I am using boto, but don't think that is the problem here : my question regards any api (but not cli).

I start a JobFlow with RunJobFlow which returns me a JobFlowId. Let's assume I don't want to keep the number, but rather find later what JobFlows are running to add steps to them.

I think I should be able to use DescribeJobFlows, to find all jobflow_ids and proceed from there. But on documentation (http://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_DescribeJobFlows.html) this api call is marked as deprecated, and directs us to use ListClusters, which returns cluster_ids.

What ties the 2 together ? Is it the same identifier ? If not how can I get jobflows ids from the cluster id ?

I think the confusion also comes by the fact that on cli the command is "create-cluster" and that returns a cluster_id, and add-steps also takes a cluster_id....

like image 446
user2123288 Avatar asked Jul 06 '15 10:07

user2123288


People also ask

How does EMR determine cluster size?

To calculate the HDFS capacity of a cluster, for each core node, add the instance store volume capacity to the Amazon EBS storage capacity (if used). Multiply the result by the number of core nodes, and then divide the total by the replication factor based on the number of core nodes.

What is EMR Autoscaling?

Automatic scaling with a custom policy in Amazon EMR release versions 4.0 and later allows you to programmatically scale out and scale in core nodes and task nodes based on a CloudWatch metric and other parameters that you specify in a scaling policy.

How is Amazon EMR different from traditional database?

How is Amazon's Elastic Map Reduce (EMR) different from a traditional database? O Queries are run in real time O Big data is stored in large object tables O Queries are dynamic O It applies the schema at the time of the query​ See what the community says and unlock a badge.


1 Answers

The cluster id and job flow id are the same thing (j-######). A cluster id is a more appropriate name to its purpose as to not be confused with the terminology of a job as seen with Hadoop. So go ahead and use ListClusters (http://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_ListClusters.html).

like image 80
ChristopherB Avatar answered Oct 22 '22 01:10

ChristopherB