Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS EKS NodeGroup "Create failed": Instances failed to join the kubernetes cluster

I am able to create an EKS cluster but when I try to add nodegroups, I receive a "Create failed" error with details: "NodeCreationFailure": Instances failed to join the kubernetes cluster

I tried a variety of instance types and increasing larger volume sizes (60gb) w/o luck. Looking at the EC2 instances, I only see the below problem. However, it is difficult to do anything since i'm not directly launching the EC2 instances (the EKS NodeGroup UI Wizard is doing that.)

How would one move forward given the failure happens even before I can jump into the ec2 machines and "fix" them?

Amazon Linux 2

Kernel 4.14.198-152.320.amzn2.x86_64 on an x86_64

ip-187-187-187-175 login: [ 54.474668] cloud-init[3182]: One of the configured repositories failed (Unknown), [ 54.475887] cloud-init[3182]: and yum doesn't have enough cached data to continue. At this point the only [ 54.478096] cloud-init[3182]: safe thing yum can do is fail. There are a few ways to work "fix" this: [ 54.480183] cloud-init[3182]: 1. Contact the upstream for the repository and get them to fix the problem. [ 54.483514] cloud-init[3182]: 2. Reconfigure the baseurl/etc. for the repository, to point to a working [ 54.485198] cloud-init[3182]: upstream. This is most often useful if you are using a newer [ 54.486906] cloud-init[3182]: distribution release than is supported by the repository (and the [ 54.488316] cloud-init[3182]: packages for the previous distribution release still work). [ 54.489660] cloud-init[3182]: 3. Run the command with the repository temporarily disabled [ 54.491045] cloud-init[3182]: yum --disablerepo= ... [ 54.491285] cloud-init[3182]: 4. Disable the repository permanently, so yum won't use it by default. Yum [ 54.493407] cloud-init[3182]: will then just ignore the repository until you permanently enable it [ 54.495740] cloud-init[3182]: again or use --enablerepo for temporary usage: [ 54.495996] cloud-init[3182]: yum-config-manager --disable

like image 754
CoderOfTheNight Avatar asked Oct 24 '20 16:10

CoderOfTheNight


4 Answers

I noticed there was no answer here, but about 2k visits to this question over the last six months. There seems to be a number of reasons why you could be seeing these failures. To regurgitate the AWS documentation found here: https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html

  • The aws-auth-cm.yaml file does not have the correct IAM role ARN for your nodes. Ensure that the node IAM role ARN (not the instance profile ARN) is specified in your aws-auth-cm.yaml file. For more information, see Launching self-managed Amazon Linux nodes.

  • The ClusterName in your node AWS CloudFormation template does not exactly match the name of the cluster you want your nodes to join. Passing an incorrect value to this field results in an incorrect configuration of the node's /var/lib/kubelet/kubeconfig file, and the nodes will not join the cluster.

  • The node is not tagged as being owned by the cluster. Your nodes must have the following tag applied to them, where is replaced with the name of your cluster.

    Key Value kubernetes.io/cluster/<cluster-name> 
    Value owned
  • The nodes may not be able to access the cluster using a public IP address. Ensure that nodes deployed in public subnets are assigned a public IP address. If not, you can associate an Elastic IP address to a node after it's launched. For more information, see Associating an Elastic IP address with a running instance or network interface. If the public subnet is not set to automatically assign public IP addresses to instances deployed to it, then we recommend enabling that setting. For more information, see Modifying the public IPv4 addressing attribute for your subnet. If the node is deployed to a private subnet, then the subnet must have a route to a NAT gateway that has a public IP address assigned to it.

  • The STS endpoint for the Region that you're deploying the nodes to is not enabled for your account. To enable the region, see Activating and deactivating AWS STS in an AWS Region.

  • The worker node does not have a private DNS entry, resulting in the kubelet log containing a node "" not found error. Ensure that the VPC where the worker node is created has values set for domain-name and domain-name-servers as Options in a DHCP options set. The default values are domain-name:.compute.internal and domain-name-servers:AmazonProvidedDNS. For more information, see DHCP options sets in the Amazon VPC User Guide.

I myself had an issue with the tagging where I needed an uppercase letter. In reality, if you can use another avenue to deploy your EKS cluster I would recommend it (eksctl, aws cli, terraform even).

like image 183
Gregory Martin Avatar answered Oct 13 '22 16:10

Gregory Martin


Adding another reason to the list:

In my case the Nodes were running in a private subnets and I haven't configured a private endpoint under API server endpoint access.

After the update the nodes groups weren't updated automatically so I had to recreate them.

like image 35
RtmY Avatar answered Oct 13 '22 15:10

RtmY


In my case, the problem was that I was deploying my node group in a private subnet, but this private subnet had no NAT gateway associated, hence no internet access. What I did was:

  1. Create a NAT gateway

  2. Create a new routetable with the following routes (the second one is the internet access route, through nat):

  • Destination: VPC-CIDR-block Target: local
  • Destination: 0.0.0.0/0 Target: NAT-gateway-id
  1. Associate private subnet with the routetable created in the second-step.

After that, nodegroups joined the clusters without problem.

like image 3
manavellam Avatar answered Oct 13 '22 15:10

manavellam


I will try to make the answer short by highlighting a few things that can go wrong in frontline.

1. Add the IAM role which is attached to EKS worker node, to the aws-auth config map in kube-system namespace. Ref

2. Login to the worker node which is created and failed to join the cluster. Try connecting to API server from inside using nc. Eg: nc -vz 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com 443

3. If you are not using the EKS node from the drop down in AWS Console (which means you are using a LT or LC in the AWS EC2), dont forget to add the userdata section in the Launch template. Ref

set -o xtrace
/etc/eks/bootstrap.sh ${ClusterName} ${BootstrapArguments}

4. Check the EKS worker IAM node policy and see it has the appropriate permissions added. AmazonEKS_CNI_Policy is a must.

5. Your nodes must have the following tag applied to them, where cluster-name is replaced with the name of your cluster. kubernetes.io/cluster/cluster-name: owned

I hope your problem lies within this list.

Ref: https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html https://aws.amazon.com/premiumsupport/knowledge-center/resolve-eks-node-failures/

like image 3
Kishor U Avatar answered Oct 13 '22 16:10

Kishor U