Chef - Nodes deregistering automatically

kristy · July 11, 2021, 11:07am

We have some nodes creating via eks(kubernetes) and it is registring to chef server using validator key method. Once any instance is get created it will register with chef automatically . 2 days back we were verified that nodes are available in chef web ui. Today when i check the nodes get deleted automatically. How it is happen. Is there any deregistering can do from node side ?

timothyT · July 11, 2021, 12:59pm

Are you using Chef Automate?

timothyT · July 11, 2021, 1:31pm

https://docs.chef.io/automate/client_runs/#managing-node-data

kristy · July 11, 2021, 3:25pm

Is chef-client cookbook will do this activity ? Eqch nodes applied the chef-client cookbook in some interval. Also more thing is chef_guid make anything ?

timothyT · July 11, 2021, 4:56pm

Where are the nodes “disappearing” from

timothyT · July 11, 2021, 6:49pm

What is the “chef web ui”?

kristy · July 11, 2021, 8:47pm

From chef server , web ui is nothing it is chef manage console

rolandHawkins · July 11, 2021, 9:54pm

Are you saying your nodes are eks pods?

kristy · July 11, 2021, 10:55pm

Not pods, we have to do with worker nodes. Not monitoring pods . This workers is part of the auto scaling. We have included the keys(client.rb,valudator.pem and firstboot.json) in the ami. And bootstrap script (aws bootstarp) will execute when new nodes get launched. When i increase autoscale desired state 5 to 6 one worker node is launching and this node getting registering with chef( so that we dont have any issue of ami or bootstrapping). But after couple of hours it is disappearing.

rolandHawkins · July 12, 2021, 12:50am

I see. ok the eks worker ec2 instances. In your runlist do you have the client configured to run regularly? Usually every 30-60 minutes. Automate has a data retention feature that marks hosts as missing after so long and then removes them. The other thing you want to watch for is with eks managed nodes they may be swapped out occasionally and if you don’t have a system to create a unique name for the node when it joins the chef-server it may be rejected due to existing node object. Validator key doesn’t have permissions to replace existing nodes.

rolandHawkins · July 12, 2021, 2:03am

Here’s our bootstrap script that we use to generate hostnames with unique number at the end with the instance ip address. We use terraform template resource to supply a couple of the variable’s values. This gives initial config and then in our runlist we have our base cookbook that configures the official config and sets the client up to run as a systemd.timer every 30 minutes.

kristy · July 12, 2021, 4:00am

It is make sence, i have configured run chef -client every 30 min . I am using chef-client version 15.3.8 Is it included the chef automate To do the cleanup? . Also as you mentioned if it is running every 30 min also how disappearing existing node? . Say for example in my eks have 5 desired nodes (always this 5 nodes will available until it is delete) if auto scale happend adding 2 more nodes . total 7 . after some time if autoscaled down happen will delete those 2 and has to clean up those 2 using chef automate. In my scenario it is deleing all worker nodes. Is it fix the issue if i am not using chef-clinet cookbook ?

rolandHawkins · July 12, 2021, 5:12am

If your client registers with chef-server but during it’s first client-run during bootstrap it fails to converge you’ll see the node but it’ll go missing and if it never got to your chef-client in your run list to configure it to run every 30 minutes then it wont run unless manually triggered. It also wont have a run list assigned in chef-server as that gets updated when chef-client reports back the results after successful chef-client run. My guess is you may be hitting a compile error somewhere.

rolandHawkins · July 12, 2021, 6:55am

In my script above I added exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1 to log the scripts output to help diagnose bootstrap errors.

kristy · July 12, 2021, 8:10am

I noticed one thing. As you suggested before the node name is making problm. If the node is register with ip-10-140-… it is de registering automatically