kubelet: Rejected pods should be filtered from admission #104817

smarterclayton · 2021-09-07T16:30:30Z

A pod that has been rejected by admission will have status manager set the phase to Failed locally, which make take some time to propagate to the apiserver. The rejected pod will be included in admission until the apiserver propagates the change back, which was an unintended regression when checking pod worker state as authoritative.

A pod that is terminal in the API may still be consuming resources on the system, so it should still be included in admission.

Discovered while reviewing implications of #104577 on static pods and admission - the way a rejection is recorded isn't by excluding the pod from the active list in pod manager, it's by reporting it to statusManager. That was regressed by #103668 (we started allowing pods that were rejected but not yet propagated to the apiserver to be included), and #104577 was not complete.

Created #104824 to reference the broader issue - that not all pods are part of this loop.

Fix a 1.22 regression where a pod that the Kubelet rejects was still considered as being accepted for a brief period of time after rejection, which might cause some pods to be rejected briefly that could fit on the node.  A pod that is still terminating (but has status indicating it has failed) may also still be consuming resources and so should also be considered.

k8s-ci-robot · 2021-09-07T16:31:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

liggitt · 2021-09-07T16:37:22Z

cc @SergeyKanzhelev @bobbypage

A pod that has been rejected by admission will have status manager set the phase to Failed locally, which make take some time to propagate to the apiserver. The rejected pod will be included in admission until the apiserver propagates the change back, which was an unintended regression when checking pod worker state as authoritative. A pod that is terminal in the API may still be consuming resources on the system, so it should still be included in admission.

smarterclayton · 2021-09-08T14:30:57Z

Found another wrinkle - the previous fix in #104577 started excluding pods that are terminal but not yet terminated from admission. For instance, a two container RestartNever pod that has container 1 fail with exit code 1 can have Failed phase applied immediately, but container 2 can still be terminating. The resources in container 2 are still "live" and can't be handed out to others, so the pod should show up in admission as OtherPods

Of note, anyone doing resource allocation in admission needs to assume any pod that is still partially running consumes "all resources" in that pod, because we don't have a granular way in the Kubelet to communicate that.

bobbypage · 2021-09-08T20:53:04Z

pkg/kubelet/kubelet_test.go

-		pods[2].UID: true,
-		pods[3].UID: true,
-		pods[4].UID: true,
+	// pod that is running but has been rejected by admission is excluded


This can only happen during "soft" admission correct?

If it was rejected during hard admission then AFAIU the pod would never enter running phase in first place?

It can happen on a kubelet restart. There's nothing "special" about hard admission except that the kubelet never reruns it outside of a restart, but must run it on a restart, and there's no real guarantee that something allowed preivously is allowed on restart. Concrete example - the restart is to change config flags on the kubelet that then cause the pod node selector labels to be wrong.

I see, thanks for explanation, makes sense.

bobbypage · 2021-09-08T21:00:36Z

pkg/kubelet/kubelet_pods.go

+			// inactive (may have been rejected by Kubelet admision)
+			if status, ok := kl.statusManager.GetPodStatus(p.UID); ok {
+				isTerminal = status.Phase == v1.PodSucceeded || status.Phase == v1.PodFailed
+			}


Just to confirm my understanding, the idea is that isTerminal when reading the pod phase i.e.

isTerminal := p.Status.Phase == v1.PodSucceeded || p.Status.Phase == v1.PodFailed

vs when reading from the statusManager

if status, ok := kl.statusManager.GetPodStatus(p.UID); ok { isTerminal = status.Phase == v1.PodSucceeded || status.Phase == v1.PodFailed }

would differ when kubelet would reject the pod during pod admission. In that case, only status manager would see the pod in terminal phase? Is that correct and why these two checks are required?

Think about this as "how stale is the status we're getting as input". If a pod on the apiserver is Running, and then the Kubelet decides to reject it, it takes some non-zero time for Kubelet to write the status back to apiserver so that it shows up in the watch and then comes back down to the Kubelet. For that amount of time, "status" is authoritative and so the Kubelet knows that the pod is rejected but the apiserver doesn't. If you restarted the Kubelet right after the decision was made but before the apiserver got the write, the new kubelet has no history other than what is on the apiserver, so it's going to start trying to start the pod again.

Conversely, if a control plane controller says "hey this failed because " the Kubelet should honor it (which is the first check) and it overrides the local status (for instance, if local Kubelet saw the pod was "Succeeded" at the same time a control plane controller marked it "Failed", the correct outcome is "Failed", because both of those are terminal states for the phase state machine and first one wins).

The last wrinkle is a static pod. A static pod has no other source of truth than a file on the local disk, so Kubelet considers it "always running", and so the status manager can reject it but only until the kubelet is restarted (since the mirror pod isn't "truth" for the pod, it's a copy of the truth the kubelet holds).

Thank a lot for the explanation, it's much more clear now. I think adding some comment to summarize your explanation here would be really helpful. Maybe something along the lines of:

// Consider pods as terminal if they are in succeeded of failed phase; terminal pods are considered inactive UNLESS they are actively terminating // Pods can be put seen in terminal state by two sources which may not necessarily be in sync with each other: API server and local kubelet status manager. First, check pod phase as seen from api-server and also check local kubelet status manager (which may have a more up to date phase state) which should take precedence.

bobbypage · 2021-09-09T21:56:09Z

/lgtm

pacoxu · 2021-09-10T01:48:21Z

/kind bug

ehashman · 2021-09-10T20:01:04Z

/priority important-soon
/triage accepted

…4817-upstream-release-1.22 Manual cherry pick of #104817: kubelet: Rejected pods should be filtered from admission

xkd045 · 2022-11-04T14:22:08Z

pkg/kubelet/kubelet_pods.go

+				isTerminal = status.Phase == v1.PodSucceeded || status.Phase == v1.PodFailed
+			}
+		}
+		if isTerminal && !kl.podWorkers.IsPodTerminationRequested(p.UID) {


Here why we should filter out pods that are actively terminating? i think we should honor the status provided by apiserver or status manager.
Suppose a succeed pod which is executing syncTerminating and a new pod scheduled for the idle resource, then it will be rejected by AdmitHandler.
@smarterclayton

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 7, 2021

k8s-ci-robot requested review from ehashman and krmayankk September 7, 2021 16:32

smarterclayton mentioned this pull request Sep 7, 2021

kubelet: Admission must exclude completed pods and avoid races #104577

Merged

smarterclayton mentioned this pull request Sep 7, 2021

Pods that are terminating but force deleted must still be accounted for in admission #104824

Open

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 7, 2021

smarterclayton force-pushed the pod_status branch from 54422b8 to 1a51ed6 Compare September 7, 2021 18:01

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 7, 2021

smarterclayton force-pushed the pod_status branch from 1a51ed6 to 17d32ed Compare September 8, 2021 14:24

smarterclayton changed the title ~~WIP: kubelet: Rejected pods should be filtered from admission~~ kubelet: Rejected pods should be filtered from admission Sep 8, 2021

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 8, 2021

bobbypage reviewed Sep 8, 2021

View reviewed changes

ehashman added this to Triage in SIG Node PR Triage Sep 9, 2021

smarterclayton mentioned this pull request Sep 9, 2021

kubelet: Handle UID reuse in pod worker #104847

Merged

1 task

k8s-ci-robot assigned bobbypage Sep 9, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 9, 2021

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Sep 10, 2021

k8s-ci-robot merged commit 1dcea5c into kubernetes:master Sep 10, 2021

SIG Node PR Triage automation moved this from Triage to Done Sep 10, 2021

k8s-ci-robot added this to the v1.23 milestone Sep 10, 2021

ehashman mentioned this pull request Sep 10, 2021

Manual cherry pick of #104817: kubelet: Rejected pods should be filtered from admission #104918

Merged

rphillips mentioned this pull request Sep 10, 2021

Bug 2003269: UPSTREAM: 104817: kubelet: Rejected pods should be filtered from admission openshift/kubernetes#948

Merged

This was referenced Sep 11, 2021

[release-4.9] Bug 2003306: UPSTREAM: 104817: kubelet: Rejected pods should be filtered from admission openshift/kubernetes#949

Merged

Updating openshift-enterprise-pod images to be consistent with ART openshift/kubernetes#933

Merged

openshift-ci-robot mentioned this pull request Oct 6, 2021

Rebase v1.22.2 openshift/kubernetes#1002

Closed

ehashman mentioned this pull request Oct 6, 2021

Kubelet is reporting OutOfCpu on previously running workloads after restart #105523

Closed

openshift-ci-robot mentioned this pull request Oct 7, 2021

[release-4.9] Bug 2011815: UPSTREAM: 105527: kubelet: do not arbitrarily create a podSyncStatus for finished pods openshift/kubernetes#1009

Merged

This was referenced Oct 29, 2021

Bug 2018516: 4.9: bump(github.com/openshift/*): make go.{mod,sum} point to 1.22.1 openshift/kubernetes#1026

Closed

UPSTREAM: <drop>: bump openshift, k8s to 1.22.1 openshift/kubernetes#1029

Closed

k8s-ci-robot added a commit that referenced this pull request Nov 29, 2021

Merge pull request #104918 from ehashman/automated-cherry-pick-of-#10…

be40a1f

…4817-upstream-release-1.22 Manual cherry pick of #104817: kubelet: Rejected pods should be filtered from admission

liggitt added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Apr 27, 2022

aojea mentioned this pull request Apr 29, 2022

"Terminated" pod on shutdown node listed in service edpoints. #109718

Closed

xkd045 reviewed Nov 4, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelet: Rejected pods should be filtered from admission #104817

kubelet: Rejected pods should be filtered from admission #104817

smarterclayton commented Sep 7, 2021 •

edited by liggitt

k8s-ci-robot commented Sep 7, 2021

liggitt commented Sep 7, 2021

smarterclayton commented Sep 8, 2021 •

edited

bobbypage Sep 8, 2021

smarterclayton Sep 9, 2021

bobbypage Sep 9, 2021

bobbypage Sep 8, 2021

smarterclayton Sep 9, 2021 •

edited

bobbypage Sep 9, 2021 •

edited

bobbypage commented Sep 9, 2021

pacoxu commented Sep 10, 2021

ehashman commented Sep 10, 2021

xkd045 Nov 4, 2022 •

edited

kubelet: Rejected pods should be filtered from admission #104817

kubelet: Rejected pods should be filtered from admission #104817

Conversation

smarterclayton commented Sep 7, 2021 • edited by liggitt

k8s-ci-robot commented Sep 7, 2021

liggitt commented Sep 7, 2021

smarterclayton commented Sep 8, 2021 • edited

bobbypage Sep 8, 2021

Choose a reason for hiding this comment

smarterclayton Sep 9, 2021

Choose a reason for hiding this comment

bobbypage Sep 9, 2021

Choose a reason for hiding this comment

bobbypage Sep 8, 2021

Choose a reason for hiding this comment

smarterclayton Sep 9, 2021 • edited

Choose a reason for hiding this comment

bobbypage Sep 9, 2021 • edited

Choose a reason for hiding this comment

bobbypage commented Sep 9, 2021

pacoxu commented Sep 10, 2021

ehashman commented Sep 10, 2021

xkd045 Nov 4, 2022 • edited

Choose a reason for hiding this comment

smarterclayton commented Sep 7, 2021 •

edited by liggitt

smarterclayton commented Sep 8, 2021 •

edited

smarterclayton Sep 9, 2021 •

edited

bobbypage Sep 9, 2021 •

edited

xkd045 Nov 4, 2022 •

edited