Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface more information about plugin scores in scheduler #99411

Merged
merged 1 commit into from Mar 4, 2021

Conversation

damemi
Copy link
Contributor

@damemi damemi commented Feb 24, 2021

What type of PR is this?

/kind feature

What this PR does / why we need it:

Often when a pod does not end up on the node a user expects, the instinct is to assume some bug in the scheduler code. In reality, it is likely that other plugins are influencing the scheduling decision by increasing the score of a certain node. While the scheduler currently can log every score for every plugin at V(10), this is usually much more information than is needed at a verbosity that is much higher than practical, especially in high-usage or production clusters. Thus, the key information can be distilled to show:

  • what plugins were most influential in this scheduling decision
  • how those plugins compare to other influential plugins
  • why one node's score may be less than another's even under assumed ideal conditions

This adds a section to the scheduler's prioritizeNodes function that breaks down the top N (currently 3) scoring plugins for each node, and shows that information alongside the average score for each of those plugins across all nodes.

The benefit to this is that it is less verbose than what we currently log at V(10) (every plugin score for every node), so it can be shown at a lower log level while still providing insights into scheduling decisions.

The output looks like this:

$ kubectl create -f ~/demo-pod.yaml
pod/mypodcn84p created

$ kubectl logs pod/kube-scheduler-kind-control-plane2 -n kube-system -c kube-scheduler | grep mypodcn84p
I0224 21:04:52.359490       1 eventhandlers.go:164] "Add event for unscheduled pod" pod="default/mypodcn84p"
I0224 21:04:53.475906       1 scheduling_queue.go:805] "About to try and schedule pod" pod="default/mypodcn84p"
I0224 21:04:53.475919       1 scheduler.go:457] "Attempting to schedule pod" pod="default/mypodcn84p"
I0224 21:04:53.479045       1 generic_scheduler.go:490] "Top 3 plugins for pod on node" pod="default/mypodcn84p" node="kind-worker2" scores=[{Plugin:NodePreferAvoidPods Score:1000000 AverageScore:1e+06} {Plugin:PodTopologySpread Score:200 AverageScore:200} {Plugin:InterPodAffinity Score:100 AverageScore:52.333333333333336}]
I0224 21:04:53.479440       1 generic_scheduler.go:490] "Top 3 plugins for pod on node" pod="default/mypodcn84p" node="kind-worker" scores=[{Plugin:NodePreferAvoidPods Score:1000000 AverageScore:1e+06} {Plugin:PodTopologySpread Score:200 AverageScore:200} {Plugin:TaintToleration Score:100 AverageScore:66.66666666666667}]
I0224 21:04:53.479495       1 generic_scheduler.go:490] "Top 3 plugins for pod on node" pod="default/mypodcn84p" node="kind-worker3" scores=[{Plugin:NodePreferAvoidPods Score:1000000 AverageScore:1e+06} {Plugin:PodTopologySpread Score:200 AverageScore:200} {Plugin:TaintToleration Score:100 AverageScore:66.66666666666667}]
I0224 21:04:53.479508       1 generic_scheduler.go:540] "Calculated node's final score for pod" pod="default/mypodcn84p" node="kind-worker2" score=1000486
I0224 21:04:53.479914       1 generic_scheduler.go:540] "Calculated node's final score for pod" pod="default/mypodcn84p" node="kind-worker" score=1000554
I0224 21:04:53.479949       1 generic_scheduler.go:540] "Calculated node's final score for pod" pod="default/mypodcn84p" node="kind-worker3" score=1000529
I0224 21:04:53.480367       1 default_binder.go:51] "Attempting to bind pod to node" pod="default/mypodcn84p" node="kind-worker"
I0224 21:04:53.492108       1 eventhandlers.go:201] "Delete event for unscheduled pod" pod="default/mypodcn84p"
I0224 21:04:53.492157       1 eventhandlers.go:221] "Add event for scheduled pod" pod="default/mypodcn84p"
I0224 21:04:53.510055       1 scheduler.go:602] "Successfully bound pod to node" pod="default/mypodcn84p" node="kind-worker" evaluatedNodes=6 feasibleNodes=3

Which issue(s) this PR fixes:

Ref #91633

Special notes for your reviewer:

Does this PR introduce a user-facing change?

kube-scheduler now logs plugin scoring summaries at --v=4

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/sig scheduling

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 24, 2021
@k8s-ci-robot
Copy link
Contributor

@damemi: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 24, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damemi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 24, 2021
@damemi damemi changed the title [WIP] Surface more information about plugin scores in scheduler Surface more information about plugin scores in scheduler Feb 24, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 24, 2021
@damemi damemi force-pushed the scheduler-score-logging branch 2 times, most recently from a6f83f0 to 3f54936 Compare February 24, 2021 20:57
// Summarize all scores.
result := make(framework.NodeScoreList, 0, len(nodes))
totalPluginScores := make(map[string]int64)

for i := range nodes {
result = append(result, framework.NodeScore{Name: nodes[i].Name, Score: 0})
for j := range scoresMap {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rename j to something more meaningful. I think it refers to plugin here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think that this is not a very efficient implementation of nested loops, for 2 reasons:

  1. we can save lookups by doing:
    for plugin, nodeScores := range scoresMap
    
  2. Iterating over the map in the outer loop and nodes in the inner loop would keep the slices closer to faster caches in the CPU.

Can you fix this in a separate PR? It would be nice if it shows in any performance benchmarks, but plugins are probably way slower to run. Maybe it will show in some profiles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll open another PR for that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget this please :)

/lgtm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened here: #99807

// Summarize all scores.
result := make(framework.NodeScoreList, 0, len(nodes))
totalPluginScores := make(map[string]int64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is adding to the production runtime even if V(4) is not satisfied. You have to do it inside the conditional, even if it repeats code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 so that totalPluginScores[j] += scoresMap[j][i].Score below will not always costs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved totalPluginScores[j] += scoresMap[j][i].Score under a V(4) check, but the variable initialization doesn't work if it's put into a conditional context

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was suggesting you do variable initialization and iteration over scoresMap inside the big conditional below. To keep all the changes related to logging in a single location.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still unresolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I get what you mean. I moved this to logPluginScores with its own iteration over scoresMap. Is that what you meant?

pkg/scheduler/core/generic_scheduler.go Outdated Show resolved Hide resolved
for _, node := range nodes {
nodeToPluginScores[node.Name] = make(framework.NodeScoreList, len(scoresMap))
}
// Convert the scoresMap (which contains Plugins->NodeScores) to the Nodes->PluginScores map
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't really have a Nodes->PluginScores map, right?

// Summarize all scores.
result := make(framework.NodeScoreList, 0, len(nodes))
totalPluginScores := make(map[string]int64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 so that totalPluginScores[j] += scoresMap[j][i].Score below will not always costs.

}

// Enhanced plugin score logging. The goal of this block is to show the highest scoring plugins on each node,
// and the average score for those plugins across all nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to argue that provide such information is not always helpful for some cases, lower scoring plugins maybe the culprit for why scheduler doesn't choose the node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, there are some cases where seeing the lowest-scoring plugins will be more helpful. I chose to go with the top scoring plugins because there are a lot of cases where someone asks, "why didn't the pod go on X node when I have this plugin enabled?" and the answer is because another plugin outscored their desired plugin (usually balanced resource allocation). On the flip side, that also means their plugin scored low. So maybe there is some way to show both?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think top scores is almost usually enough. At least if their plugin of interest is missing they can figure that it's scoring low. And if they really want to know everything, they can enable V(10)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if they really want to know everything, they can enable V(10)

That's a good point too

pkg/scheduler/core/generic_scheduler.go Outdated Show resolved Hide resolved
@@ -492,7 +535,7 @@ func (g *genericScheduler) prioritizeNodes(
}
}

if klog.V(10).Enabled() {
if klog.V(4).Enabled() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this information matches well with the info that I'm exposing in this PR. The highest scoring plugins themselves are a direct contributor to the total score for that node, so I'm lowering it from V(10), which is the level at which we log every score for every plugin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@damemi damemi force-pushed the scheduler-score-logging branch 2 times, most recently from 741b545 to 9f20ba2 Compare February 25, 2021 17:57
@damemi
Copy link
Contributor Author

damemi commented Mar 2, 2021

/retest

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just nits left :)

@@ -492,7 +535,7 @@ func (g *genericScheduler) prioritizeNodes(
}
}

if klog.V(10).Enabled() {
if klog.V(4).Enabled() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

pkg/scheduler/core/generic_scheduler.go Outdated Show resolved Hide resolved
pkg/scheduler/core/generic_scheduler.go Outdated Show resolved Hide resolved
pkg/scheduler/core/generic_scheduler.go Outdated Show resolved Hide resolved
@damemi
Copy link
Contributor Author

damemi commented Mar 4, 2021

Bump, any more feedback on this?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants