Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm: do not check if the /etc/kubernetes/manifests folder is empty on joining worker nodes during preflight #104942

Merged
merged 1 commit into from Sep 15, 2021

Conversation

SataQiu
Copy link
Member

@SataQiu SataQiu commented Sep 12, 2021

What type of PR is this?

/kind feature
/kind bug

What this PR does / why we need it:

kubeadm: do not check if the /etc/kubernetes/manifests folder is empty on joining worker nodes during preflight

Which issue(s) this PR fixes:

Fixes kubernetes/kubeadm#2564

Special notes for your reviewer:

Does this PR introduce a user-facing change?

kubeadm: do not check if the '/etc/kubernetes/manifests' folder is empty on joining worker nodes during preflight

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Sep 12, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SataQiu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 12, 2021
Copy link
Member

@pacoxu pacoxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
It makes sense.
/cc neolit123

@pacoxu
Copy link
Member

pacoxu commented Sep 13, 2021

/assign fabriziopandini
Per kubernetes/kubeadm#2564 (comment)

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SataQiu thanks for this PR!
Q: as per kubernetes/kubeadm#2564 (comment), should we also make this checks consistent between init and join?

FileAvailableCheck{Path: kubeadmconstants.GetStaticPodFilepath(kubeadmconstants.KubeAPIServer, manifestsDir)},
FileAvailableCheck{Path: kubeadmconstants.GetStaticPodFilepath(kubeadmconstants.KubeControllerManager, manifestsDir)},
FileAvailableCheck{Path: kubeadmconstants.GetStaticPodFilepath(kubeadmconstants.KubeScheduler, manifestsDir)},
FileAvailableCheck{Path: kubeadmconstants.GetStaticPodFilepath(kubeadmconstants.Etcd, manifestsDir)},
Copy link
Member

@neolit123 neolit123 Sep 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these checks are control plane checks so it makes more sense to have them under cfg.ControlPlane != nil below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but something else to note here...looking at phases/join/preflight.go RunInitNodeChecks is already called for control plane nodes:

if j.Cfg().ControlPlane != nil {
// Checks if the cluster configuration supports
// joining a new control plane instance and if all the necessary certificates are provided
hasCertificateKey := len(j.CertificateKey()) > 0
if err := checkIfReadyForAdditionalControlPlane(&initCfg.ClusterConfiguration, hasCertificateKey); err != nil {
// outputs the not ready for hosting a new control plane instance message
ctx := map[string]string{
"Error": err.Error(),
}
var msg bytes.Buffer
notReadyToJoinControlPlaneTemp.Execute(&msg, ctx)
return errors.New(msg.String())
}
// run kubeadm init preflight checks for checking all the prerequisites
fmt.Println("[preflight] Running pre-flight checks before initializing the new control plane instance")
if err := preflight.RunInitNodeChecks(utilsexec.New(), initCfg, j.IgnorePreflightErrors(), true, hasCertificateKey); err != nil {
return err

i.e. secondary control-plane nodes already should have the same checks as the first control plane node (mostly).

one difference is the isSecondaryControlPlane flag which is true on join.

so i think the correct patch is to just remove this check:

DirAvailableCheck{Path: filepath.Join(kubeadmconstants.KubernetesDir, kubeadmconstants.ManifestsSubDirName)},

on joining worker nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @neolit123 ! I agree with your summary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neolit123 @dlipovetsky Thanks for your review.

But is there such a scenario:
The user first joins a node as a secondary control-plane node successfully.

kubeadm join xxx --token xxx
	--discovery-token-ca-cert-hash xxx
	--control-plane --certificate-key xxx

The user then discovers that he intended to join a normal node instead of a control-plane node.
So he deletes the node from the cluster and runs the join command again (forgot to run the reset command):

kubectl delete node xxx
kubeadm join xxx --token xxx --discovery-token-ca-cert-hash xxx

Not surprisingly, this command will output errors similar to the following (without DirAvailableCheck):

[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
	[ERROR FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists
	[ERROR Port-10250]: Port 10250 is in use
	[ERROR FileAvailable--etc-kubernetes-pki-ca.crt]: /etc/kubernetes/pki/ca.crt already exists

Based on these error messages, the user manually cleaned the relevant files:

rm -rf /etc/kubernetes/kubelet.conf 
rm -rf /etc/kubernetes/pki/ca.crt
systemctl stop kubelet

Then the user can run the join command again, and it will execute successfully.

[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

At this point, however, the Pod that should be running on the control plane node is running on the normal node by mistake. This is not the desired state of a healthy cluster.

# kubectl get node
NAME                      STATUS     ROLES                  AGE    VERSION
izhp3acgtbq11k6ldsw593z   NotReady   control-plane,master   32m    v1.22.1
izhp3acgtbq11k6ldsw594z   NotReady   <none>                 114s   v1.22.1

# kubectl get pod -A
kube-system   coredns-7f6cbbb7b8-2l5j9                          0/1     Pending   0               32m
kube-system   coredns-7f6cbbb7b8-sglnx                          0/1     Pending   0               32m
kube-system   etcd-izhp3acgtbq11k6ldsw593z                      1/1     Running   2               33m
kube-system   etcd-izhp3acgtbq11k6ldsw594z                      1/1     Running   1               2m9s
kube-system   kube-apiserver-izhp3acgtbq11k6ldsw593z            1/1     Running   2               33m
kube-system   kube-apiserver-izhp3acgtbq11k6ldsw594z            1/1     Running   1               2m8s
kube-system   kube-controller-manager-izhp3acgtbq11k6ldsw593z   1/1     Running   2 (3m18s ago)   33m
kube-system   kube-controller-manager-izhp3acgtbq11k6ldsw594z   1/1     Running   1               2m8s
kube-system   kube-proxy-wxxpw                                  1/1     Running   1               29m
kube-system   kube-proxy-zc2zj                                  1/1     Running   0               32m
kube-system   kube-scheduler-izhp3acgtbq11k6ldsw593z            1/1     Running   4 (3m18s ago)   33m
kube-system   kube-scheduler-izhp3acgtbq11k6ldsw594z            1/1     Running   1               2m8s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not calling kubeadm reset after the initial control plane node join would mean that the control plane containers would continue running even after the kubectl delete of the node. Reset is sort of mandatory in all cases and If we are still seeing users that do not know about it perhaps it's time to start documenting it on the CLI output?

I think I agree with your assesment that worker join should errors on the presence of kubeadm managed manifests. One potential weird use case that this would break is if someone is joining an etcd node to the cluster - i.e. a worker with an etcd static pod.

Frankly, i would have designed this slightly differently:

  • error out on the cp manifests only on cp nodes
  • depending on the kubeadm command init / join / join cp, print a list of what additional files are in the manifest directory with hopes that the user can adjust if something unwanted ended up there.

Copy link
Member

@neolit123 neolit123 Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: If we remove DirAvailableCheck would users that were skipping it previously now get an error that there is no such preflight check?

Trying to refresh my memory if we had that.

EDIT: ok looks like kubeadm init --ignore-preflight-errors=foo does not error out that there is no such check called foo. that is good in a way since every time we delete a check the user will not get broken, but at the same time it's not ideal because if the user has a typo in the preflight check they will not get an indication for that...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with your assesment that worker join should errors on the presence of kubeadm managed manifests.

+1

One potential weird use case that this would break is if someone is joining an etcd node to the cluster - i.e. a worker with an etcd static pod.

In hindsight, kubeadm could have avoided conflict with user-defined manifests by using a prefix, e.g. kubeadm-etcd.yaml. But even in the absence of such a prefix, I think it's reasonable for kubeadm to treat the manifests, including etcd.yaml as "reserved."

I can follow up with a docs PR to make it clear that kubeadm-managed manifests are "reserved."

Copy link
Member

@neolit123 neolit123 Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlipovetsky are you happy with the current state of the diff?
it just removes the directory check, and one can still start a pod called etcd.yaml on a worker without any error / warning.

from your comment above i think i'm reading that join of workers should complain about reserved control-plane pod names as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you happy with the current state of the diff?

Yes.

from your comment above i think i'm reading that join of workers should complain about reserved control-plane pod names as well.

I was thinking out loud. I do think it's a good idea for kubeadm to treat filenames it writes as "reserved." That does more than address this bug, though, and can be a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, understood.

@neolit123
Copy link
Member

kubeadm: init/join preflight check is limited to the static Pod manifest files generated by kubeadm instead of the whole manifests directory, custom static Pod manifest files will be ignored in the preflight check now

we should change the release note depending on #104942 (comment)

e.g.

kubeadm: do no check if the /etc/kubernetes/manifests folder is empty on joining worker nodes during preflight

@SataQiu
Copy link
Member Author

SataQiu commented Sep 14, 2021

@SataQiu thanks for this PR!
Q: as per kubernetes/kubeadm#2564 (comment), should we also make this checks consistent between init and join?

@fabriziopandini

Yes, the checks in the init phase are defined here:

// RunInitNodeChecks executes all individual, applicable to control-plane node checks.
// The boolean flag 'isSecondaryControlPlane' controls whether we are running checks in a --join-control-plane scenario.
// The boolean flag 'downloadCerts' controls whether we should skip checks on certificates because we are downloading them.
// If the flag is set to true we should skip checks already executed by RunJoinNodeChecks.
func RunInitNodeChecks(execer utilsexec.Interface, cfg *kubeadmapi.InitConfiguration, ignorePreflightErrors sets.String, isSecondaryControlPlane bool, downloadCerts bool) error {
if !isSecondaryControlPlane {
// First, check if we're root separately from the other preflight checks and fail fast
if err := RunRootCheckOnly(ignorePreflightErrors); err != nil {
return err
}
}
manifestsDir := filepath.Join(kubeadmconstants.KubernetesDir, kubeadmconstants.ManifestsSubDirName)
checks := []Checker{
NumCPUCheck{NumCPU: kubeadmconstants.ControlPlaneNumCPU},
// Linux only
// TODO: support other OS, if control-plane is supported on it.
MemCheck{Mem: kubeadmconstants.ControlPlaneMem},
KubernetesVersionCheck{KubernetesVersion: cfg.KubernetesVersion, KubeadmVersion: kubeadmversion.Get().GitVersion},
FirewalldCheck{ports: []int{int(cfg.LocalAPIEndpoint.BindPort), kubeadmconstants.KubeletPort}},
PortOpenCheck{port: int(cfg.LocalAPIEndpoint.BindPort)},
PortOpenCheck{port: kubeadmconstants.KubeSchedulerPort},
PortOpenCheck{port: kubeadmconstants.KubeControllerManagerPort},
FileAvailableCheck{Path: kubeadmconstants.GetStaticPodFilepath(kubeadmconstants.KubeAPIServer, manifestsDir)},
FileAvailableCheck{Path: kubeadmconstants.GetStaticPodFilepath(kubeadmconstants.KubeControllerManager, manifestsDir)},
FileAvailableCheck{Path: kubeadmconstants.GetStaticPodFilepath(kubeadmconstants.KubeScheduler, manifestsDir)},
FileAvailableCheck{Path: kubeadmconstants.GetStaticPodFilepath(kubeadmconstants.Etcd, manifestsDir)},
HTTPProxyCheck{Proto: "https", Host: cfg.LocalAPIEndpoint.AdvertiseAddress},
}

In the init preflight check, we only check the specific manifest files, not the whole /etc/kubernetes/manifests folder.

@SataQiu SataQiu changed the title kubeadm: init/join preflight check is limited to the static Pod manifest files generated by kubeadm instead of the whole manifests directory, custom static Pod manifest files will be ignored in the preflight check now kubeadm: do no check if the /etc/kubernetes/manifests folder is empty on joining worker nodes during preflight Sep 14, 2021
@SataQiu SataQiu changed the title kubeadm: do no check if the /etc/kubernetes/manifests folder is empty on joining worker nodes during preflight kubeadm: do not check if the /etc/kubernetes/manifests folder is empty on joining worker nodes during preflight Sep 14, 2021
…y on joining worker nodes during preflight

Signed-off-by: SataQiu <shidaqiu2018@gmail.com>
@dlipovetsky
Copy link
Contributor

Thanks @SataQiu for thinking about the details and fixing the issue!

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 15, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 15, 2021
@k8s-ci-robot k8s-ci-robot merged commit 8975906 into kubernetes:master Sep 15, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Sep 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
6 participants