New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet: Rejected pods should be filtered from admission #104817
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
54422b8
to
1a51ed6
Compare
A pod that has been rejected by admission will have status manager set the phase to Failed locally, which make take some time to propagate to the apiserver. The rejected pod will be included in admission until the apiserver propagates the change back, which was an unintended regression when checking pod worker state as authoritative. A pod that is terminal in the API may still be consuming resources on the system, so it should still be included in admission.
1a51ed6
to
17d32ed
Compare
Found another wrinkle - the previous fix in #104577 started excluding pods that are terminal but not yet terminated from admission. For instance, a two container RestartNever pod that has container 1 fail with exit code 1 can have Failed phase applied immediately, but container 2 can still be terminating. The resources in container 2 are still "live" and can't be handed out to others, so the pod should show up in admission as Of note, anyone doing resource allocation in admission needs to assume any pod that is still partially running consumes "all resources" in that pod, because we don't have a granular way in the Kubelet to communicate that. |
pods[2].UID: true, | ||
pods[3].UID: true, | ||
pods[4].UID: true, | ||
// pod that is running but has been rejected by admission is excluded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can only happen during "soft" admission correct?
If it was rejected during hard admission then AFAIU the pod would never enter running phase in first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can happen on a kubelet restart. There's nothing "special" about hard admission except that the kubelet never reruns it outside of a restart, but must run it on a restart, and there's no real guarantee that something allowed preivously is allowed on restart. Concrete example - the restart is to change config flags on the kubelet that then cause the pod node selector labels to be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for explanation, makes sense.
// inactive (may have been rejected by Kubelet admision) | ||
if status, ok := kl.statusManager.GetPodStatus(p.UID); ok { | ||
isTerminal = status.Phase == v1.PodSucceeded || status.Phase == v1.PodFailed | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm my understanding, the idea is that isTerminal when reading the pod phase i.e.
isTerminal := p.Status.Phase == v1.PodSucceeded || p.Status.Phase == v1.PodFailed
vs when reading from the statusManager
if status, ok := kl.statusManager.GetPodStatus(p.UID); ok {
isTerminal = status.Phase == v1.PodSucceeded || status.Phase == v1.PodFailed
}
would differ when kubelet would reject the pod during pod admission. In that case, only status manager would see the pod in terminal phase? Is that correct and why these two checks are required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think about this as "how stale is the status we're getting as input". If a pod on the apiserver is Running, and then the Kubelet decides to reject it, it takes some non-zero time for Kubelet to write the status back to apiserver so that it shows up in the watch and then comes back down to the Kubelet. For that amount of time, "status" is authoritative and so the Kubelet knows that the pod is rejected but the apiserver doesn't. If you restarted the Kubelet right after the decision was made but before the apiserver got the write, the new kubelet has no history other than what is on the apiserver, so it's going to start trying to start the pod again.
Conversely, if a control plane controller says "hey this failed because " the Kubelet should honor it (which is the first check) and it overrides the local status (for instance, if local Kubelet saw the pod was "Succeeded" at the same time a control plane controller marked it "Failed", the correct outcome is "Failed", because both of those are terminal states for the phase state machine and first one wins).
The last wrinkle is a static pod. A static pod has no other source of truth than a file on the local disk, so Kubelet considers it "always running", and so the status manager can reject it but only until the kubelet is restarted (since the mirror pod isn't "truth" for the pod, it's a copy of the truth the kubelet holds).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank a lot for the explanation, it's much more clear now. I think adding some comment to summarize your explanation here would be really helpful. Maybe something along the lines of:
// Consider pods as terminal if they are in succeeded of failed phase; terminal pods are considered inactive UNLESS they are actively terminating
// Pods can be put seen in terminal state by two sources which may not necessarily be in sync with each other: API server and local kubelet status manager. First, check pod phase as seen from api-server and also check local kubelet status manager (which may have a more up to date phase state) which should take precedence.
/lgtm |
/kind bug |
/priority important-soon |
…4817-upstream-release-1.22 Manual cherry pick of #104817: kubelet: Rejected pods should be filtered from admission
isTerminal = status.Phase == v1.PodSucceeded || status.Phase == v1.PodFailed | ||
} | ||
} | ||
if isTerminal && !kl.podWorkers.IsPodTerminationRequested(p.UID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here why we should filter out pods that are actively terminating? i think we should honor the status provided by apiserver or status manager.
Suppose a succeed pod which is executing syncTerminating and a new pod scheduled for the idle resource, then it will be rejected by AdmitHandler.
@smarterclayton
A pod that has been rejected by admission will have status manager set the phase to Failed locally, which make take some time to propagate to the apiserver. The rejected pod will be included in admission until the apiserver propagates the change back, which was an unintended regression when checking pod worker state as authoritative.
A pod that is terminal in the API may still be consuming resources on the system, so it should still be included in admission.
Discovered while reviewing implications of #104577 on static pods and admission - the way a rejection is recorded isn't by excluding the pod from the active list in pod manager, it's by reporting it to statusManager. That was regressed by #103668 (we started allowing pods that were rejected but not yet propagated to the apiserver to be included), and #104577 was not complete.
Created #104824 to reference the broader issue - that not all pods are part of this loop.