New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet: Handle UID reuse in pod worker #104847
kubelet: Handle UID reuse in pod worker #104847
Conversation
In theory this fixes the problem but you have to wait for a reconcile loop of ~90s. A slightly more complex impl might queue the next pod. I'll take a look at that. Ideally SyncKnownPods would be able to restart workers, but SyncKnownPods doesn't get passed "the pods the pod worker should know about" today. It needs to be "the admitted pods that should be running" which is a subset of what is in the pod manager. |
7e3c2ba
to
a4cbc55
Compare
a4cbc55
to
05579f6
Compare
Ok, after thinking through this some more I'm fairly convinced this is safe. Builds on top of #104817 (which renames a bit of the admission logic):
A pod that restarts this way will wait at most |
05579f6
to
6dc65c1
Compare
/priority critical-urgent |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-node-kubelet-serial-containerd |
/hold cancel |
@smarterclayton: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
TestCustomResourceCascadingDeletion flake in pull-kubernetes-integration /test pull-kubernetes-integration |
…04847-upstream-release-1.22 Automated cherry pick of #104847: kubelet: Handle UID reuse in pod worker
A pod that has been rejected by admission will have status manager set the phase to Failed locally, which make take some time to propagate to the apiserver. The rejected pod will be included in admission until the apiserver propagates the change back, which was an unintended regression when checking pod worker state as authoritative. A pod that is terminal in the API may still be consuming resources on the system, so it should still be included in admission. [ehashman] Rebased on top of kubernetes#104847.
If a pod is killed (no longer wanted) and then a subsequent create/add/update event is seen in the pod worker, assume that a pod UID was reused (as it could be in static pods) and have the next SyncKnownPods after the pod terminates remove the worker history so that the config loop can restart the static pod, as well as return to the caller the fact that this termination was not final.
A pod that restarts this way will wait at most housekeeping loop period (2s) between being terminated and starting again.
/kind bug
/sig node
Fixes #104648
TODO: