New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix nodelifecyle controller not add NoExecute taint bug #96876
fix nodelifecyle controller not add NoExecute taint bug #96876
Conversation
/ok-to-test |
/triage accepted |
bd1f6a7
to
7c18544
Compare
0c3a714
to
c4fd122
Compare
/retest |
49d95d5
to
e463230
Compare
@howieyuen: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@howieyuen: You must be a member of the kubernetes/milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your and have them propose you as an additional delegate for this responsibility. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/area node-lifecycle |
thank you for the detail. i have to admit the issue upon review was hard to follow without it, so the test and detail are much appreciated! /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr, howieyuen The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…issing fix nodelifecyle controller not add NoExecute taint bug
…issing fix nodelifecyle controller not add NoExecute taint bug
1.18.12 has this issue too, can you cherry-pick to 1.18 ? |
…issing fix nodelifecyle controller not add NoExecute taint bug
Not sure if there's a reason that cherry-picks prs are not merged to 1.18-1.20. |
Cherry pick #96876 in controller to 1.18: fix nodelifecyle controller not add NoExecute taint bug
Cherry pick #96876 in controller to 1.20: fix nodelifecyle controller not add NoExecute taint bug
Cherry pick #96876 in controller to 1.19: fix nodelifecyle controller not add NoExecute taint bug
What type of PR is this?
/kind bug
What this PR does / why we need it:
this PR #89059 try to fix reconcile problem, so every 5s
monitorNodeHealth()
runprocessTaintBaseEviction()
, add nodes tozoneNoExecuteTainter
cause these nodes' status isUnknown
orFalse
.However, every time we need add untainted nodes to
RateLimitedTimedQueue
, this PR try to delete it first in order to enter queue every time. Delete action use a additional funcSetRemve()
as below instead ofRemove()
:When taintManager start working(
doNoExecuteTaintingPass()
) and its QPS defaults as0.1
, so here is a scenario may case nodes will never getNoExecute
taint except kube-controller-manager restart and reconstruct its queue data:UniqueQueue
insideRateLimitedTimedQueue
looks like this:doNoExecuteTaintingPass()
not finish the taint job, andmonitNodeHealth()
run in next period, and enqueue 3 nodes again with set data removed but queue data left, and theUniqueQueue
insideRateLimitedTimedQueue
looks like this:node0node1node2doNoExecuteTaintingPass()
continue to deal with these untainted nodes, this func fetch data from queue not set one by one, start with dirty data, suppose that before handling the duplicated data "node0", node0 return normal (same as node1 and node2), soActionFunc()
innc.zoneNoExecuteTainter[k].Try(fn ActionFunc)
returns true, and funcTry()
callsq.queue.RemoveFromQueue(val.Value)
, but it cannot be removed because the set value is not existed. So queue's head cannot be removed normally, next running circle still get the dirty data, and the taint job go stuck foreverWhich issue(s) this PR fixes:
Fix: #94183 #96183
Special notes for your reviewer:
I write a helper func to print values inside
RateLimitedTimedQueue
, and unit test running log as below, and log display the dirty data inside queue field.before set
SetRemove()
toRemove()
After:
Does this PR introduce a user-facing change?