Automated cherry pick of #93044: Initialize scheduler's podInformer in sharedInformerFactory #105015

alculquicondor · 2021-09-14T17:23:10Z

Cherry pick of #93044 and #93566 on release-1.19.

#93044: Initialize scheduler's podInformer in sharedInformerFactory

#93566: Optimize the use of informer for scheduler

/sig scheduling
/kind bug

Because of the use of 2 separate informers, the scheduler might try to schedule a Pod, fail and never retry, because it thinks that the Pod doesn't exist. Reality was that the Pod was not yet added to the 2nd informer.

This is solved by using a single informer. This fix was originally introduced in 1.20 as a memory optimization, but it turns out to guarantee correct behavior.

#93566 is supposed to just be a refactoring. However, since we don't this would be the last 1.19 release, it's better to include it, as we have signal that both changes work well in production.

Fixes #104521

Fix a regression in 1.19 where a race condition in scheduler could make a new unschedulable Pod never be retried.

Scheduler's specific podInfomer is now initialized inside the sahredInformerFactory.

k8s-ci-robot · 2021-09-14T17:23:18Z

@alculquicondor: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alculquicondor · 2021-09-14T17:30:53Z

/remove-sig apps

alculquicondor · 2021-09-14T17:32:03Z

/assign @liggitt
for test

/priority critical-urgent

alculquicondor · 2021-09-14T17:47:22Z

/retest

liggitt · 2021-09-14T17:52:18Z

pkg/scheduler/scheduler.go

@@ -311,9 +306,6 @@ func initPolicyFromConfigMap(client clientset.Interface, policyRef *schedulerapi

 // Run begins watching and scheduling. It waits for cache to be synced, then starts scheduling and blocked until the context is done.
 func (sched *Scheduler) Run(ctx context.Context) {
-	if !cache.WaitForCacheSync(ctx.Done(), sched.scheduledPodsHasSynced) {


we block for shared informers to sync before this point, right?

yes, that's in cmd/kube-scheduler/app/server.go#Run

Huang-Wei · 2021-09-14T19:18:42Z

/lgtm

Huang-Wei · 2021-09-14T19:18:55Z

/kind bug

liggitt · 2021-09-14T19:29:45Z

does this code now match the state of 1.20, or were there follow up changes/fixes? ideally, this would take advantage of the signal we have in 1.20 (especially since this is the last 1.19 release), and #93044 (comment) sounds like #93566 also touched what shipped in 1.20 for this

Huang-Wei · 2021-09-14T19:34:53Z

@liggitt #93566 is just a cleanup, so doesn't quite need to be back ported.

liggitt · 2021-09-14T19:37:50Z

@liggitt #93566 is just a cleanup, so doesn't quite need to be back ported.

we also thought #93044 was just an optimization... are we positive the change made in #93566 doesn't modify behavior in any way?

pkg/scheduler/eventhandlers.go

justaugustus · 2021-09-14T20:25:04Z

FYI @kubernetes/release-engineering
ref: https://kubernetes.slack.com/archives/C2C40FMNF/p1631640903180200

/hold until discussion is resolved

alculquicondor · 2021-09-14T20:50:30Z

I agree with @Huang-Wei that #93566 was just supposed to be a refactoring. But we are safer to cherry-picking the change too as the intermediate state hasn't made it to any release and thus has no production signal.

Huang-Wei · 2021-09-14T21:11:30Z

are we positive the change made in #93566 doesn't modify behavior in any way?

I read that #93566's key idea is to use a more well-encapsulated NewFilteredPodInformer() function to replace cache.NewListWatchFromClient(). Underneath they work the same. So I feel positive.

But I'm not going to oppose cherry-picking #93566 if we're more comfortable with containing it - it won't have side effects.

alculquicondor · 2021-09-14T21:11:39Z

/retest

alculquicondor · 2021-09-14T21:22:44Z

/retest

liggitt · 2021-09-14T22:36:34Z

/lgtm
/approve

k8s-ci-robot · 2021-09-14T22:37:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/kube-scheduler/OWNERS~~ [alculquicondor,liggitt]
~~pkg/scheduler/OWNERS~~ [alculquicondor,liggitt]
~~test/OWNERS~~ [liggitt]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alculquicondor · 2021-09-14T23:07:16Z

I believe this should be merged in this patch cycle (1.19.15).
Without this patch, it is possible that some Pods remain unschedulable indefinitely. The only workaround possible is to restart the scheduler, which is hidden for most cloud providers.

liggitt · 2021-09-14T23:11:02Z

I agree. This is a regression from 1.18 that was accidentally resolved by this change that merged to 1.20. The regression needs to be resolved and this change has had soak time in 1.20 since 1.20.0

justaugustus · 2021-09-14T23:17:05Z

Approving this post-deadline cherry pick based on the comments from @alculquicondor and @liggitt:

I believe this should be merged in this patch cycle (1.19.15).
Without this patch, it is possible that some Pods remain unschedulable indefinitely. The only workaround possible is to restart the scheduler, which is hidden for most cloud providers.

I agree. This is a regression from 1.18 that was accidentally resolved by this change that merged to 1.20. The regression needs to be resolved and this change has had soak time in 1.20 since 1.20.0

justaugustus · 2021-09-14T23:17:28Z

/hold cancel

Initialize scheduler's podInformer in sharedInformerFactory

461448a

Scheduler's specific podInfomer is now initialized inside the sahredInformerFactory.

k8s-ci-robot added this to the v1.19 milestone Sep 14, 2021

k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/test sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Sep 14, 2021

k8s-ci-robot requested review from damemi and foxish September 14, 2021 17:24

k8s-ci-robot removed the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Sep 14, 2021

k8s-ci-robot assigned liggitt Sep 14, 2021

k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Sep 14, 2021

liggitt reviewed Sep 14, 2021

View reviewed changes

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Sep 14, 2021

liggitt reviewed Sep 14, 2021

View reviewed changes

pkg/scheduler/eventhandlers.go Show resolved Hide resolved

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2021

optimize the use of informer for scheduler

1c49fd9

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Sep 14, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 14, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 14, 2021

justaugustus added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Sep 14, 2021

k8s-ci-robot removed the do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. label Sep 14, 2021

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2021

k8s-ci-robot merged commit 3445faa into kubernetes:release-1.19 Sep 14, 2021

Huang-Wei mentioned this pull request Sep 15, 2021

Pods stuck in Pending status - kube-scheduler 1.19.13 #104521

Closed

liggitt added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated cherry pick of #93044: Initialize scheduler's podInformer in sharedInformerFactory #105015

Automated cherry pick of #93044: Initialize scheduler's podInformer in sharedInformerFactory #105015

alculquicondor commented Sep 14, 2021 •

edited by liggitt

k8s-ci-robot commented Sep 14, 2021

alculquicondor commented Sep 14, 2021

alculquicondor commented Sep 14, 2021

alculquicondor commented Sep 14, 2021

liggitt Sep 14, 2021

alculquicondor Sep 14, 2021

Huang-Wei commented Sep 14, 2021

Huang-Wei commented Sep 14, 2021

liggitt commented Sep 14, 2021

Huang-Wei commented Sep 14, 2021

liggitt commented Sep 14, 2021

justaugustus commented Sep 14, 2021

alculquicondor commented Sep 14, 2021 •

edited

Huang-Wei commented Sep 14, 2021

alculquicondor commented Sep 14, 2021

alculquicondor commented Sep 14, 2021

liggitt commented Sep 14, 2021

k8s-ci-robot commented Sep 14, 2021

alculquicondor commented Sep 14, 2021

liggitt commented Sep 14, 2021

justaugustus commented Sep 14, 2021

justaugustus commented Sep 14, 2021

Automated cherry pick of #93044: Initialize scheduler's podInformer in sharedInformerFactory #105015

Automated cherry pick of #93044: Initialize scheduler's podInformer in sharedInformerFactory #105015

Conversation

alculquicondor commented Sep 14, 2021 • edited by liggitt

k8s-ci-robot commented Sep 14, 2021

alculquicondor commented Sep 14, 2021

alculquicondor commented Sep 14, 2021

alculquicondor commented Sep 14, 2021

liggitt Sep 14, 2021

Choose a reason for hiding this comment

alculquicondor Sep 14, 2021

Choose a reason for hiding this comment

Huang-Wei commented Sep 14, 2021

Huang-Wei commented Sep 14, 2021

liggitt commented Sep 14, 2021

Huang-Wei commented Sep 14, 2021

liggitt commented Sep 14, 2021

justaugustus commented Sep 14, 2021

alculquicondor commented Sep 14, 2021 • edited

Huang-Wei commented Sep 14, 2021

alculquicondor commented Sep 14, 2021

alculquicondor commented Sep 14, 2021

liggitt commented Sep 14, 2021

k8s-ci-robot commented Sep 14, 2021

alculquicondor commented Sep 14, 2021

liggitt commented Sep 14, 2021

justaugustus commented Sep 14, 2021

justaugustus commented Sep 14, 2021

alculquicondor commented Sep 14, 2021 •

edited by liggitt

alculquicondor commented Sep 14, 2021 •

edited