Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in CPUManager with race on container map access #97427

Merged

Conversation

klueska
Copy link
Contributor

@klueska klueska commented Dec 21, 2020

What type of PR is this?
/kind bug

What this PR does / why we need it:
Fixes a sporadic bug that causes the kubelet to crash:

Dec 19 12:01:13 ngcdgx2k8s0094d kubelet[69329]: fatal error: concurrent map iteration and map write
Dec 19 12:01:13 ngcdgx2k8s0094d kubelet[69329]: goroutine 141 [running]:
Dec 19 12:01:13 ngcdgx2k8s0094d kubelet[69329]: runtime.throw(0x43a5919, 0x26)
Dec 19 12:01:13 ngcdgx2k8s0094d kubelet[69329]:         /usr/local/go/src/runtime/panic.go:774 +0x72 fp=0xc0036eaca8 sp=0xc0036eac78 pc=0x432eb2
Dec 19 12:01:13 ngcdgx2k8s0094d kubelet[69329]: runtime.mapiternext(0xc0036eadb0)
Dec 19 12:01:13 ngcdgx2k8s0094d kubelet[69329]:         /usr/local/go/src/runtime/map.go:858 +0x579 fp=0xc0036ead30 sp=0xc0036eaca8 pc=0x412db9
Dec 19 12:01:13 ngcdgx2k8s0094d kubelet[69329]: k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/containermap.ContainerMap.GetContainerID(0xc000cae360, 0xc002671350, 0x24, 0xc0038c15a0, 0xb, 0xc0038c15a0, 0xb, 0x0, 0x0)
Dec 19 12:01:13 ngcdgx2k8s0094d kubelet[69329]:         /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/containermap/container_map.go:57 +0x9e fp=0xc0036eae20 sp=0xc0036ead30 pc=0x1b2806e
Dec 19 12:01:13 ngcdgx2k8s0094d kubelet[69329]: k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/containermap.ContainerMap.RemoveByContainerRef(...)
Dec 19 12:01:13 ngcdgx2k8s0094d kubelet[69329]:         /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/containermap/container_map.go:49
Dec 19 12:01:13 ngcdgx2k8s0094d kubelet[69329]: k8s.io/kubernetes/pkg/kubelet/cm/cpumanager.(*manager).policyRemoveContainerByRef(0xc00065c990, 0xc002671350, 0x24, 0xc0038c15a0, 0xb, 0x2, 0x2)
Fixed bug in CPUManager with race on container map access

Signed-off-by: Kevin Klues <kklues@nvidia.com>
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 21, 2020
@klueska
Copy link
Contributor Author

klueska commented Dec 21, 2020

/triage accepted

@@ -402,6 +402,7 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
continue
}

m.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this would be slightly easier to follow if we called m.Lock() twice, each time before invoking the methods accessing containerMap. Just easier to track where and why the lock is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually happy with the way it is currently (I've been stress testing it like this in a soak environment over the last 2 weeks without issue). If you feel strongly though, I can add small helper functions to do the locking around each call.

@nolancon
Copy link

nolancon commented Jan 4, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 4, 2021
@odinuge
Copy link
Member

odinuge commented Jan 4, 2021

Change looks good to me, but agree with comment from @andrewsykim, so i'll let @klueska look at that.

/lgtm
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 4, 2021
@klueska
Copy link
Contributor Author

klueska commented Jan 5, 2021

I chatted with @andrewsykim offline and, while I tend to agree with him on some level, I also think it’s pretty clear as is (and I’ve been stress testing it in its current form for 2 weeks without issue). He has agreed that his issue is non-blocking and we've agreed to merge this as-is.

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 5, 2021
Copy link
Contributor

@hasheddan hasheddan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@klueska I'll wait until this one merges to go through cherry-picks 👍 Feel free to ping me if I don't follow up promptly.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasheddan, klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 4dc3a42 into kubernetes:master Jan 5, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Jan 5, 2021
k8s-ci-robot added a commit that referenced this pull request Jan 5, 2021
…7-upstream-release-1.19

Automated cherry pick of #97427: Fix bug in CPUManager with race on map acccess
k8s-ci-robot added a commit that referenced this pull request Jan 5, 2021
…7-upstream-release-1.20

Automated cherry pick of #97427: Fix bug in CPUManager with race on map acccess
k8s-ci-robot added a commit that referenced this pull request Jan 6, 2021
…7-upstream-release-1.18

Automated cherry pick of #97427: Fix bug in CPUManager with race on map acccess
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants