Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated cherry pick of #93044: Initialize scheduler's podInformer in sharedInformerFactory #105015

Conversation

alculquicondor
Copy link
Member

@alculquicondor alculquicondor commented Sep 14, 2021

Cherry pick of #93044 and #93566 on release-1.19.

#93044: Initialize scheduler's podInformer in sharedInformerFactory

#93566: Optimize the use of informer for scheduler

/sig scheduling
/kind bug

Because of the use of 2 separate informers, the scheduler might try to schedule a Pod, fail and never retry, because it thinks that the Pod doesn't exist. Reality was that the Pod was not yet added to the 2nd informer.

This is solved by using a single informer. This fix was originally introduced in 1.20 as a memory optimization, but it turns out to guarantee correct behavior.

#93566 is supposed to just be a refactoring. However, since we don't this would be the last 1.19 release, it's better to include it, as we have signal that both changes work well in production.

Fixes #104521

Fix a regression in 1.19 where a race condition in scheduler could make a new unschedulable Pod never be retried.

Scheduler's specific podInfomer is now initialized inside the sahredInformerFactory.
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Sep 14, 2021
@k8s-ci-robot k8s-ci-robot added do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 14, 2021
@k8s-ci-robot
Copy link
Contributor

@alculquicondor: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/test sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Sep 14, 2021
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 14, 2021
@alculquicondor
Copy link
Member Author

/remove-sig apps

@k8s-ci-robot k8s-ci-robot removed the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Sep 14, 2021
@alculquicondor
Copy link
Member Author

/assign @liggitt
for test

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Sep 14, 2021
@alculquicondor
Copy link
Member Author

/retest

@@ -311,9 +306,6 @@ func initPolicyFromConfigMap(client clientset.Interface, policyRef *schedulerapi

// Run begins watching and scheduling. It waits for cache to be synced, then starts scheduling and blocked until the context is done.
func (sched *Scheduler) Run(ctx context.Context) {
if !cache.WaitForCacheSync(ctx.Done(), sched.scheduledPodsHasSynced) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we block for shared informers to sync before this point, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's in cmd/kube-scheduler/app/server.go#Run

@Huang-Wei
Copy link
Member

/lgtm

@Huang-Wei
Copy link
Member

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Sep 14, 2021
@liggitt
Copy link
Member

liggitt commented Sep 14, 2021

does this code now match the state of 1.20, or were there follow up changes/fixes? ideally, this would take advantage of the signal we have in 1.20 (especially since this is the last 1.19 release), and #93044 (comment) sounds like #93566 also touched what shipped in 1.20 for this

@Huang-Wei
Copy link
Member

@liggitt #93566 is just a cleanup, so doesn't quite need to be back ported.

@liggitt
Copy link
Member

liggitt commented Sep 14, 2021

@liggitt #93566 is just a cleanup, so doesn't quite need to be back ported.

we also thought #93044 was just an optimization... are we positive the change made in #93566 doesn't modify behavior in any way?

@justaugustus
Copy link
Member

FYI @kubernetes/release-engineering
ref: https://kubernetes.slack.com/archives/C2C40FMNF/p1631640903180200

/hold until discussion is resolved

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2021
@alculquicondor
Copy link
Member Author

alculquicondor commented Sep 14, 2021

I agree with @Huang-Wei that #93566 was just supposed to be a refactoring. But we are safer to cherry-picking the change too as the intermediate state hasn't made it to any release and thus has no production signal.

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Sep 14, 2021
@Huang-Wei
Copy link
Member

are we positive the change made in #93566 doesn't modify behavior in any way?

I read that #93566's key idea is to use a more well-encapsulated NewFilteredPodInformer() function to replace cache.NewListWatchFromClient(). Underneath they work the same. So I feel positive.

But I'm not going to oppose cherry-picking #93566 if we're more comfortable with containing it - it won't have side effects.

@alculquicondor
Copy link
Member Author

/retest

1 similar comment
@alculquicondor
Copy link
Member Author

/retest

@liggitt
Copy link
Member

liggitt commented Sep 14, 2021

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 14, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 14, 2021
@alculquicondor
Copy link
Member Author

I believe this should be merged in this patch cycle (1.19.15).
Without this patch, it is possible that some Pods remain unschedulable indefinitely. The only workaround possible is to restart the scheduler, which is hidden for most cloud providers.

@liggitt
Copy link
Member

liggitt commented Sep 14, 2021

I agree. This is a regression from 1.18 that was accidentally resolved by this change that merged to 1.20. The regression needs to be resolved and this change has had soak time in 1.20 since 1.20.0

@justaugustus
Copy link
Member

Approving this post-deadline cherry pick based on the comments from @alculquicondor and @liggitt:

I believe this should be merged in this patch cycle (1.19.15).
Without this patch, it is possible that some Pods remain unschedulable indefinitely. The only workaround possible is to restart the scheduler, which is hidden for most cloud providers.

I agree. This is a regression from 1.18 that was accidentally resolved by this change that merged to 1.20. The regression needs to be resolved and this change has had soak time in 1.20 since 1.20.0

@justaugustus justaugustus added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Sep 14, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. label Sep 14, 2021
@justaugustus
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2021
@k8s-ci-robot k8s-ci-robot merged commit 3445faa into kubernetes:release-1.19 Sep 14, 2021
@liggitt liggitt added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants