Fix dangling volumes from nodes not tracked by attach detach controller #96689

gnufied · 2020-11-18T19:45:40Z

Fixes vsphere dangling volumes from nodes not tracked by Attach Detach Controller.

After opening #96224, I found a bug where if a node has no pods with volumes running on it, we don't run VerifyVolumesAreAttached check for such a node and hence dangling volume mechanism does not work for it.

This is a follow up to fix that code and ensure that all known nodes are scanned periodically for volumes.

/sig storage

Detach volumes from vSphere nodes not tracked by attach-detach controller

msau42 · 2020-11-19T01:33:56Z

@kubernetes/sig-storage-pr-reviews
cc @misterikkit @yuga711

staging/src/k8s.io/legacy-cloud-providers/vsphere/vsphere_util.go

staging/src/k8s.io/legacy-cloud-providers/vsphere/vsphere_volume_map.go

misterikkit · 2020-11-19T01:41:57Z

staging/src/k8s.io/legacy-cloud-providers/vsphere/vsphere_util.go

+	for _, nodeName := range nodeDetails {
+		// if given node is not in node volume map
+		if !vs.vsphereVolumeMap.CheckForNode(nodeName) {
+			nodeInfo, err := vs.nodeManager.GetNodeInfo(nodeName)


I'm trying to understand where/how we would learn the name of an unknown node. It doesn't look to me like you are listing all VMs in a datacenter (that would be a lot!)

I will add some additional details and I updated PR description. Basically this is a follow up of #96224, because I found a bug where if a node has no pods with volumes scheduled on it, we don't run VerifyVolumesAreAttached on that node and as a result any dangling volume left on such node is never detached.

So strictly speaking this PR isn't detaching volumes from "unknown" nodes but more like nodes which aren't in attach-detach controller's cache (because there are no pods with volume on those nodes).

Also, I am only listing VMs that are still part of k8s cluster.

Jiawei0227 · 2020-11-19T01:49:49Z

/cc

yuga711 · 2020-11-19T02:04:42Z

staging/src/k8s.io/legacy-cloud-providers/vsphere/vsphere_util.go

@@ -608,6 +609,119 @@ func (vs *VSphere) checkDiskAttached(ctx context.Context, nodes []k8stypes.NodeN
 	return nodesToRetry, nil
 }

+// BuildMissingVolumeNodeMap builds a map of volumes and nodes which are not known to attach detach controller


Is this to detect cases where a k8s volume is attached to nodes outside the current k8s cluster?

no - I am sorry I chose wrong wording in my original commit message. This case is for detaching volumes from nodes which are still part of k8s cluster but aren't inside attach/detach controller's cache(or ASOW) because that node has no pods with volumes scheduled on it.

gnufied · 2020-11-19T04:13:15Z

/retest

staging/src/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go

staging/src/k8s.io/legacy-cloud-providers/vsphere/vsphere.go

jsafrane · 2020-11-19T10:19:31Z

staging/src/k8s.io/legacy-cloud-providers/vsphere/vsphere_util.go

+		go func(nodes []k8stypes.NodeName) {
+			err := vs.checkNodeDisks(ctx, nodeNames)
+			if err != nil {
+				klog.Errorf("Failed to check disk attached for nodes: %+v. err: %+v", nodes, err)


I don't like throwing away errors like this, it should get propagated to the caller.

you mean propagate upto ADC reconciler? But since this reconciliation is asynchronous and is not part of a user action, we can't report this as event. So even ADC is going to just log the error and nothing else.

Yes, up to ADC (or whoever calls DisksAreAttached), so if we want to report errors later, we just process DisksAreAttached errors. BTW, DisksAreAttached itself already collects / reports errors from goroutines when checking the "known" nodeVolumes so you can copy it from there.

I can see the point that the called did not ask for check of unrelated nodes and is not interested in their errors.

staging/src/k8s.io/legacy-cloud-providers/vsphere/vsphere_util.go

staging/src/k8s.io/legacy-cloud-providers/vsphere/vsphere_volume_map.go

gnufied · 2020-11-19T15:00:38Z

/triage accepted
/priority important-soon

gnufied · 2020-11-19T15:00:44Z

/retest

gnufied · 2020-11-19T15:00:56Z

/kind bug

jsafrane · 2020-12-15T10:32:18Z

/lgtm
/approve

gnufied · 2020-12-15T17:20:27Z

/assign @cheftako

msau42 · 2020-12-15T17:20:38Z

/assign @jingxu97

cheftako · 2020-12-16T01:12:40Z

Starting in v1.21, we are disallowing feature PRs into the built-in legacy cloud providers (i.e. k8s.io/legacy-cloud-providers). Any kind/feature PRs going forward will have to be approved by SIG Cloud Provider. See this mailing list thread for more details https://groups.google.com/g/kubernetes-sig-cloud-provider/c/UkG46pNc6Cw.

I note this is marked as a bug but also noted a lot of new code. I wanted to ensure that this does not involve any feature development.

gnufied · 2020-12-16T03:00:07Z

@cheftako the PR description clarifies that it is indeed a follow up fix to a bug that I was trying to fix, because my original already merged PR only fixed the originally bug partially:

Fixes vsphere dangling volumes from nodes not tracked by Attach Detach Controller.
After opening #96224, I found a bug where if a node has no pods with volumes running on it, we don't run > VerifyVolumesAreAttached check for such a node and hence dangling volume mechanism does not work for it.

This is a follow up to fix that code and ensure that all known nodes are scanned periodically for volumes.

Also

I note this is marked as a bug but also noted a lot of new code. I wanted to ensure that this does not involve any feature development.

Yes this is bit more code than what I would have liked but this was relatively tricky issue to fix. But vsphere CSI driver migration is relatively 3-4 releases away and we would like intree driver to work without bugs while that lands.

cheftako · 2020-12-16T21:52:23Z

/approve

k8s-ci-robot · 2020-12-16T21:52:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheftako, gnufied, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/legacy-cloud-providers/OWNERS~~ [cheftako]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wilsonehusin · 2021-01-14T18:51:12Z

@gnufied hello from 1.21 release team! I'm not sure if I capture the meaning of release note described. Typically bug fixes release note has the format of:

Fix [bug]. The [component] now [new behavior description].

Would it be accurate to rephrase the PR description as the following? If so, would you please help us fill the ??? and update the PR description? Thanks!

Fixes dangling vSphere volumes. The ??? now periodically scans nodes to ensure volumes are tracked by attach-detach controller.

k8s-ci-robot requested review from divyenpatel and dougm November 18, 2020 19:46

k8s-ci-robot added area/cloudprovider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. labels Nov 18, 2020

gnufied force-pushed the fix-unknown-node-dangling-vsphere branch from d182ed3 to 655917e Compare November 18, 2020 21:37

misterikkit reviewed Nov 19, 2020

View reviewed changes

k8s-ci-robot requested a review from Jiawei0227 November 19, 2020 01:49

yuga711 reviewed Nov 19, 2020

View reviewed changes

gnufied changed the title ~~Fix dangling volumes from knowns not tracked by attach detach controller~~ Fix dangling volumes from nodes not tracked by attach detach controller Nov 19, 2020

Fix unknown dangling volumes

46a57b0

gnufied force-pushed the fix-unknown-node-dangling-vsphere branch from 655917e to 46a57b0 Compare November 19, 2020 02:32

Add docs about process of discovering disks from new nodes

b6ae233

jsafrane reviewed Nov 19, 2020

View reviewed changes

Address review comments

1a32327

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 19, 2020

k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 19, 2020

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 19, 2020

k8s-ci-robot assigned jsafrane Dec 15, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 15, 2020

k8s-ci-robot assigned cheftako Dec 15, 2020

k8s-ci-robot assigned jingxu97 Dec 15, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 16, 2020

k8s-ci-robot merged commit fc43c80 into kubernetes:master Dec 16, 2020

k8s-ci-robot added this to the v1.21 milestone Dec 16, 2020

yuga711 mentioned this pull request Feb 17, 2021

REQUEST: New membership for yuga711 kubernetes/org#2507

Closed

6 tasks

openshift-ci-robot mentioned this pull request Aug 17, 2021

Bug 1981187: Fix dangling volumes vsphere openshift/kubernetes#890

Closed

gnufied mentioned this pull request Aug 24, 2021

Automated cherry pick of #96689: Fix unknown dangling volumes #104553

Closed

openshift-ci-robot mentioned this pull request Sep 7, 2021

Bug 2003027: Rebase 1.20.10 openshift/kubernetes#935

Merged

This was referenced Sep 10, 2021

Backport dangling volume fixes #104910

Merged

Fix dangling volume vsphere detaches #104912

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dangling volumes from nodes not tracked by attach detach controller #96689

Fix dangling volumes from nodes not tracked by attach detach controller #96689

gnufied commented Nov 18, 2020 •

edited

msau42 commented Nov 19, 2020

misterikkit Nov 19, 2020

gnufied Nov 19, 2020

gnufied Nov 19, 2020

Jiawei0227 commented Nov 19, 2020

yuga711 Nov 19, 2020

gnufied Nov 19, 2020

gnufied commented Nov 19, 2020

jsafrane Nov 19, 2020

gnufied Nov 19, 2020

jsafrane Dec 3, 2020

jsafrane Dec 15, 2020

gnufied commented Nov 19, 2020

gnufied commented Nov 19, 2020

gnufied commented Nov 19, 2020

jsafrane commented Dec 15, 2020

gnufied commented Dec 15, 2020

msau42 commented Dec 15, 2020

cheftako commented Dec 16, 2020

gnufied commented Dec 16, 2020 •

edited

cheftako commented Dec 16, 2020

k8s-ci-robot commented Dec 16, 2020

wilsonehusin commented Jan 14, 2021

Fix dangling volumes from nodes not tracked by attach detach controller #96689

Fix dangling volumes from nodes not tracked by attach detach controller #96689

Conversation

gnufied commented Nov 18, 2020 • edited

msau42 commented Nov 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jiawei0227 commented Nov 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied commented Nov 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied commented Nov 19, 2020

gnufied commented Nov 19, 2020

gnufied commented Nov 19, 2020

jsafrane commented Dec 15, 2020

gnufied commented Dec 15, 2020

msau42 commented Dec 15, 2020

cheftako commented Dec 16, 2020

gnufied commented Dec 16, 2020 • edited

cheftako commented Dec 16, 2020

k8s-ci-robot commented Dec 16, 2020

wilsonehusin commented Jan 14, 2021

gnufied commented Nov 18, 2020 •

edited

gnufied commented Dec 16, 2020 •

edited