New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in CPUManager with race on container map access #97427
Fix bug in CPUManager with race on container map access #97427
Conversation
Signed-off-by: Kevin Klues <kklues@nvidia.com>
/triage accepted |
@@ -402,6 +402,7 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec | |||
continue | |||
} | |||
|
|||
m.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this would be slightly easier to follow if we called m.Lock()
twice, each time before invoking the methods accessing containerMap
. Just easier to track where and why the lock is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually happy with the way it is currently (I've been stress testing it like this in a soak environment over the last 2 weeks without issue). If you feel strongly though, I can add small helper functions to do the locking around each call.
/lgtm |
Change looks good to me, but agree with comment from @andrewsykim, so i'll let @klueska look at that. /lgtm |
I chatted with @andrewsykim offline and, while I tend to agree with him on some level, I also think it’s pretty clear as is (and I’ve been stress testing it in its current form for 2 weeks without issue). He has agreed that his issue is non-blocking and we've agreed to merge this as-is. /unhold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@klueska I'll wait until this one merges to go through cherry-picks 👍 Feel free to ping me if I don't follow up promptly.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hasheddan, klueska The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…7-upstream-release-1.19 Automated cherry pick of #97427: Fix bug in CPUManager with race on map acccess
…7-upstream-release-1.20 Automated cherry pick of #97427: Fix bug in CPUManager with race on map acccess
…7-upstream-release-1.18 Automated cherry pick of #97427: Fix bug in CPUManager with race on map acccess
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fixes a sporadic bug that causes the
kubelet
to crash: