Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: Rejected pods should be filtered from admission #104817

Merged
merged 1 commit into from Sep 10, 2021

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Sep 7, 2021

A pod that has been rejected by admission will have status manager set the phase to Failed locally, which make take some time to propagate to the apiserver. The rejected pod will be included in admission until the apiserver propagates the change back, which was an unintended regression when checking pod worker state as authoritative.

A pod that is terminal in the API may still be consuming resources on the system, so it should still be included in admission.

Discovered while reviewing implications of #104577 on static pods and admission - the way a rejection is recorded isn't by excluding the pod from the active list in pod manager, it's by reporting it to statusManager. That was regressed by #103668 (we started allowing pods that were rejected but not yet propagated to the apiserver to be included), and #104577 was not complete.

Created #104824 to reference the broader issue - that not all pods are part of this loop.

Fix a 1.22 regression where a pod that the Kubelet rejects was still considered as being accepted for a brief period of time after rejection, which might cause some pods to be rejected briefly that could fit on the node.  A pod that is still terminating (but has status indicating it has failed) may also still be consuming resources and so should also be considered.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 7, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 7, 2021
@liggitt
Copy link
Member

liggitt commented Sep 7, 2021

cc @SergeyKanzhelev @bobbypage

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 7, 2021
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 7, 2021
A pod that has been rejected by admission will have status manager
set the phase to Failed locally, which make take some time to
propagate to the apiserver. The rejected pod will be included in
admission until the apiserver propagates the change back, which
was an unintended regression when checking pod worker state as
authoritative.

A pod that is terminal in the API may still be consuming resources
on the system, so it should still be included in admission.
@smarterclayton smarterclayton changed the title WIP: kubelet: Rejected pods should be filtered from admission kubelet: Rejected pods should be filtered from admission Sep 8, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 8, 2021
@smarterclayton
Copy link
Contributor Author

smarterclayton commented Sep 8, 2021

Found another wrinkle - the previous fix in #104577 started excluding pods that are terminal but not yet terminated from admission. For instance, a two container RestartNever pod that has container 1 fail with exit code 1 can have Failed phase applied immediately, but container 2 can still be terminating. The resources in container 2 are still "live" and can't be handed out to others, so the pod should show up in admission as OtherPods

Of note, anyone doing resource allocation in admission needs to assume any pod that is still partially running consumes "all resources" in that pod, because we don't have a granular way in the Kubelet to communicate that.

pods[2].UID: true,
pods[3].UID: true,
pods[4].UID: true,
// pod that is running but has been rejected by admission is excluded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can only happen during "soft" admission correct?

If it was rejected during hard admission then AFAIU the pod would never enter running phase in first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can happen on a kubelet restart. There's nothing "special" about hard admission except that the kubelet never reruns it outside of a restart, but must run it on a restart, and there's no real guarantee that something allowed preivously is allowed on restart. Concrete example - the restart is to change config flags on the kubelet that then cause the pod node selector labels to be wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for explanation, makes sense.

// inactive (may have been rejected by Kubelet admision)
if status, ok := kl.statusManager.GetPodStatus(p.UID); ok {
isTerminal = status.Phase == v1.PodSucceeded || status.Phase == v1.PodFailed
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm my understanding, the idea is that isTerminal when reading the pod phase i.e.

isTerminal := p.Status.Phase == v1.PodSucceeded || p.Status.Phase == v1.PodFailed

vs when reading from the statusManager

if status, ok := kl.statusManager.GetPodStatus(p.UID); ok {
				isTerminal = status.Phase == v1.PodSucceeded || status.Phase == v1.PodFailed
			}

would differ when kubelet would reject the pod during pod admission. In that case, only status manager would see the pod in terminal phase? Is that correct and why these two checks are required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think about this as "how stale is the status we're getting as input". If a pod on the apiserver is Running, and then the Kubelet decides to reject it, it takes some non-zero time for Kubelet to write the status back to apiserver so that it shows up in the watch and then comes back down to the Kubelet. For that amount of time, "status" is authoritative and so the Kubelet knows that the pod is rejected but the apiserver doesn't. If you restarted the Kubelet right after the decision was made but before the apiserver got the write, the new kubelet has no history other than what is on the apiserver, so it's going to start trying to start the pod again.

Conversely, if a control plane controller says "hey this failed because " the Kubelet should honor it (which is the first check) and it overrides the local status (for instance, if local Kubelet saw the pod was "Succeeded" at the same time a control plane controller marked it "Failed", the correct outcome is "Failed", because both of those are terminal states for the phase state machine and first one wins).

The last wrinkle is a static pod. A static pod has no other source of truth than a file on the local disk, so Kubelet considers it "always running", and so the status manager can reject it but only until the kubelet is restarted (since the mirror pod isn't "truth" for the pod, it's a copy of the truth the kubelet holds).

Copy link
Member

@bobbypage bobbypage Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank a lot for the explanation, it's much more clear now. I think adding some comment to summarize your explanation here would be really helpful. Maybe something along the lines of:

// Consider pods as terminal if they are in succeeded of failed phase; terminal pods are considered inactive UNLESS they are actively terminating
// Pods can be put seen in terminal state by two sources which may not necessarily be in sync with each other: API server and local kubelet status manager. First, check pod phase as seen from api-server and also check local kubelet status manager (which may have a more up to date phase state) which should take precedence. 

@ehashman ehashman added this to Triage in SIG Node PR Triage Sep 9, 2021
@bobbypage
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 9, 2021
@pacoxu
Copy link
Member

pacoxu commented Sep 10, 2021

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Sep 10, 2021
@k8s-ci-robot k8s-ci-robot merged commit 1dcea5c into kubernetes:master Sep 10, 2021
SIG Node PR Triage automation moved this from Triage to Done Sep 10, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Sep 10, 2021
@ehashman
Copy link
Member

/priority important-soon
/triage accepted

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 10, 2021
k8s-ci-robot added a commit that referenced this pull request Nov 29, 2021
…4817-upstream-release-1.22

Manual cherry pick of #104817: kubelet: Rejected pods should be filtered from admission
@liggitt liggitt added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Apr 27, 2022
isTerminal = status.Phase == v1.PodSucceeded || status.Phase == v1.PodFailed
}
}
if isTerminal && !kl.podWorkers.IsPodTerminationRequested(p.UID) {
Copy link

@xkd045 xkd045 Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here why we should filter out pods that are actively terminating? i think we should honor the status provided by apiserver or status manager.
Suppose a succeed pod which is executing syncTerminating and a new pod scheduled for the idle resource, then it will be rejected by AdmitHandler.
@smarterclayton

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

7 participants