New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover CSI volumes from dangling attachments #96617
Conversation
Hi @yuga711. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @msau42 |
err = adc.actualStateOfWorld.MarkVolumeAsUncertain(volumeName, volumeSpec, nodeName) | ||
if err != nil { | ||
klog.Errorf("MarkVolumeAsUncertain fail to add the volume %q (%q) to ASW. err: %s", volumeName, nodeName, err) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a nit: is this continue needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
/assign |
/ok-to-test |
/assign @jingxu97 |
Have we considered if we can use following error from CSI:
and use it to return dangling volume errors so as all plugins that implement |
@kubernetes/sig-storage-pr-reviews I think we'll need to look at both solutions. Relying on attach failure code means that the pod had to be rescheduled. |
Yes, that's right. We need both solutions as they have their own limitations. Attach failure code approach will not fix until next attach and may not fix in multi-attach case. |
@yuga711: Reiterating the mentions to trigger a notification: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Here I think the code does not rely on VA attachment status. As long as there is a VA, no matter the status is attached or not, the code will add volume as uncertain status to make sure following attach or detach will be triffered. |
The tests seems to be failing because a function name (IsVolumeAttachedToNode) was replaced after this PR was posted. I will need to rework a bit and will repost the change. |
/retest Review the full test history for this PR. Silence the bot with an |
/lgtm cancel to fix merge conflict |
Change-Id: I72105d67d8a4069ab19bfa4638a7ac365cf4194c
/lgtm |
consider to cherrypick this PR? |
I think that's a good idea. @gnufied any concerns? |
@msau42 should be fine to cherry-pick. |
Change-Id: I72105d67d8a4069ab19bfa4638a7ac365cf4194c
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR recovers from certain type of dangling CSI attachments by detaching them from the undesired nodes. In the current design, if kube-controller-manager crashes/restarts while attach/detach is in-progress, the volume may be left attached to an undesired node (eg., when the pod using the volume is deleted during the restart window). This PR fixes such dangling attachments by using the presence of VolumeAttachment objects during Attach/Detach Controller initialization.
This solution relies on the existence of the VA object. While this solution would recover dangling volumes caused because of certain races, it will not recover from other dangling scenarios where VA object itself is gone
Which issue(s) this PR fixes:
Partially addresses #94912
Partially addresses #80488
Fixes #77324
Special notes for your reviewer:
Verified the fix through reproduce steps in #94912
Does this PR introduce a user-facing change?:
NONE
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: