New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dangling volumes from nodes not tracked by attach detach controller #96689
Fix dangling volumes from nodes not tracked by attach detach controller #96689
Conversation
d182ed3
to
655917e
Compare
@kubernetes/sig-storage-pr-reviews |
staging/src/k8s.io/legacy-cloud-providers/vsphere/vsphere_util.go
Outdated
Show resolved
Hide resolved
staging/src/k8s.io/legacy-cloud-providers/vsphere/vsphere_volume_map.go
Outdated
Show resolved
Hide resolved
for _, nodeName := range nodeDetails { | ||
// if given node is not in node volume map | ||
if !vs.vsphereVolumeMap.CheckForNode(nodeName) { | ||
nodeInfo, err := vs.nodeManager.GetNodeInfo(nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand where/how we would learn the name of an unknown node. It doesn't look to me like you are listing all VMs in a datacenter (that would be a lot!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add some additional details and I updated PR description. Basically this is a follow up of #96224, because I found a bug where if a node has no pods with volumes scheduled on it, we don't run VerifyVolumesAreAttached
on that node and as a result any dangling volume left on such node is never detached.
So strictly speaking this PR isn't detaching volumes from "unknown" nodes but more like nodes which aren't in attach-detach controller's cache (because there are no pods with volume on those nodes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I am only listing VMs that are still part of k8s cluster.
/cc |
@@ -608,6 +609,119 @@ func (vs *VSphere) checkDiskAttached(ctx context.Context, nodes []k8stypes.NodeN | |||
return nodesToRetry, nil | |||
} | |||
|
|||
// BuildMissingVolumeNodeMap builds a map of volumes and nodes which are not known to attach detach controller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this to detect cases where a k8s volume is attached to nodes outside the current k8s cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no - I am sorry I chose wrong wording in my original commit message. This case is for detaching volumes from nodes which are still part of k8s cluster but aren't inside attach/detach controller's cache(or ASOW) because that node has no pods with volumes scheduled on it.
655917e
to
46a57b0
Compare
/retest |
go func(nodes []k8stypes.NodeName) { | ||
err := vs.checkNodeDisks(ctx, nodeNames) | ||
if err != nil { | ||
klog.Errorf("Failed to check disk attached for nodes: %+v. err: %+v", nodes, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like throwing away errors like this, it should get propagated to the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean propagate upto ADC reconciler? But since this reconciliation is asynchronous and is not part of a user action, we can't report this as event. So even ADC is going to just log the error and nothing else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, up to ADC (or whoever calls DisksAreAttached
), so if we want to report errors later, we just process DisksAreAttached
errors. BTW, DisksAreAttached
itself already collects / reports errors from goroutines when checking the "known" nodeVolumes
so you can copy it from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the point that the called did not ask for check of unrelated nodes and is not interested in their errors.
staging/src/k8s.io/legacy-cloud-providers/vsphere/vsphere_util.go
Outdated
Show resolved
Hide resolved
/triage accepted |
/retest |
/kind bug |
/lgtm |
/assign @cheftako |
/assign @jingxu97 |
Starting in v1.21, we are disallowing feature PRs into the built-in legacy cloud providers (i.e. k8s.io/legacy-cloud-providers). Any kind/feature PRs going forward will have to be approved by SIG Cloud Provider. See this mailing list thread for more details https://groups.google.com/g/kubernetes-sig-cloud-provider/c/UkG46pNc6Cw. I note this is marked as a bug but also noted a lot of new code. I wanted to ensure that this does not involve any feature development. |
@cheftako the PR description clarifies that it is indeed a follow up fix to a bug that I was trying to fix, because my original already merged PR only fixed the originally bug partially:
Also
Yes this is bit more code than what I would have liked but this was relatively tricky issue to fix. But vsphere CSI driver migration is relatively 3-4 releases away and we would like intree driver to work without bugs while that lands. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cheftako, gnufied, jsafrane The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@gnufied hello from 1.21 release team! I'm not sure if I capture the meaning of release note described. Typically bug fixes release note has the format of:
Would it be accurate to rephrase the PR description as the following? If so, would you please help us fill the
|
Fixes vsphere dangling volumes from nodes not tracked by Attach Detach Controller.
After opening #96224, I found a bug where if a node has no pods with volumes running on it, we don't run VerifyVolumesAreAttached check for such a node and hence dangling volume mechanism does not work for it.
This is a follow up to fix that code and ensure that all known nodes are scanned periodically for volumes.
/sig storage