Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dangling volumes from nodes not tracked by attach detach controller #96689

Merged

Conversation

gnufied
Copy link
Member

@gnufied gnufied commented Nov 18, 2020

Fixes vsphere dangling volumes from nodes not tracked by Attach Detach Controller.

After opening #96224, I found a bug where if a node has no pods with volumes running on it, we don't run VerifyVolumesAreAttached check for such a node and hence dangling volume mechanism does not work for it.

This is a follow up to fix that code and ensure that all known nodes are scanned periodically for volumes.

/sig storage

Detach volumes from vSphere nodes not tracked by attach-detach controller

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 18, 2020
@k8s-ci-robot k8s-ci-robot added area/cloudprovider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. labels Nov 18, 2020
@gnufied gnufied force-pushed the fix-unknown-node-dangling-vsphere branch from d182ed3 to 655917e Compare November 18, 2020 21:37
@msau42
Copy link
Member

msau42 commented Nov 19, 2020

@kubernetes/sig-storage-pr-reviews
cc @misterikkit @yuga711

for _, nodeName := range nodeDetails {
// if given node is not in node volume map
if !vs.vsphereVolumeMap.CheckForNode(nodeName) {
nodeInfo, err := vs.nodeManager.GetNodeInfo(nodeName)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand where/how we would learn the name of an unknown node. It doesn't look to me like you are listing all VMs in a datacenter (that would be a lot!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add some additional details and I updated PR description. Basically this is a follow up of #96224, because I found a bug where if a node has no pods with volumes scheduled on it, we don't run VerifyVolumesAreAttached on that node and as a result any dangling volume left on such node is never detached.

So strictly speaking this PR isn't detaching volumes from "unknown" nodes but more like nodes which aren't in attach-detach controller's cache (because there are no pods with volume on those nodes).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I am only listing VMs that are still part of k8s cluster.

@Jiawei0227
Copy link
Contributor

/cc

@@ -608,6 +609,119 @@ func (vs *VSphere) checkDiskAttached(ctx context.Context, nodes []k8stypes.NodeN
return nodesToRetry, nil
}

// BuildMissingVolumeNodeMap builds a map of volumes and nodes which are not known to attach detach controller
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to detect cases where a k8s volume is attached to nodes outside the current k8s cluster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - I am sorry I chose wrong wording in my original commit message. This case is for detaching volumes from nodes which are still part of k8s cluster but aren't inside attach/detach controller's cache(or ASOW) because that node has no pods with volumes scheduled on it.

@gnufied gnufied changed the title Fix dangling volumes from knowns not tracked by attach detach controller Fix dangling volumes from nodes not tracked by attach detach controller Nov 19, 2020
@gnufied gnufied force-pushed the fix-unknown-node-dangling-vsphere branch from 655917e to 46a57b0 Compare November 19, 2020 02:32
@gnufied
Copy link
Member Author

gnufied commented Nov 19, 2020

/retest

go func(nodes []k8stypes.NodeName) {
err := vs.checkNodeDisks(ctx, nodeNames)
if err != nil {
klog.Errorf("Failed to check disk attached for nodes: %+v. err: %+v", nodes, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like throwing away errors like this, it should get propagated to the caller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean propagate upto ADC reconciler? But since this reconciliation is asynchronous and is not part of a user action, we can't report this as event. So even ADC is going to just log the error and nothing else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, up to ADC (or whoever calls DisksAreAttached), so if we want to report errors later, we just process DisksAreAttached errors. BTW, DisksAreAttached itself already collects / reports errors from goroutines when checking the "known" nodeVolumes so you can copy it from there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the point that the called did not ask for check of unrelated nodes and is not interested in their errors.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 19, 2020
@gnufied
Copy link
Member Author

gnufied commented Nov 19, 2020

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 19, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 19, 2020
@gnufied
Copy link
Member Author

gnufied commented Nov 19, 2020

/retest

@gnufied
Copy link
Member Author

gnufied commented Nov 19, 2020

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 19, 2020
@jsafrane
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 15, 2020
@gnufied
Copy link
Member Author

gnufied commented Dec 15, 2020

/assign @cheftako

@msau42
Copy link
Member

msau42 commented Dec 15, 2020

/assign @jingxu97

@cheftako
Copy link
Member

Starting in v1.21, we are disallowing feature PRs into the built-in legacy cloud providers (i.e. k8s.io/legacy-cloud-providers). Any kind/feature PRs going forward will have to be approved by SIG Cloud Provider. See this mailing list thread for more details https://groups.google.com/g/kubernetes-sig-cloud-provider/c/UkG46pNc6Cw.

I note this is marked as a bug but also noted a lot of new code. I wanted to ensure that this does not involve any feature development.

@gnufied
Copy link
Member Author

gnufied commented Dec 16, 2020

@cheftako the PR description clarifies that it is indeed a follow up fix to a bug that I was trying to fix, because my original already merged PR only fixed the originally bug partially:

Fixes vsphere dangling volumes from nodes not tracked by Attach Detach Controller.
After opening #96224, I found a bug where if a node has no pods with volumes running on it, we don't run > VerifyVolumesAreAttached check for such a node and hence dangling volume mechanism does not work for it.

This is a follow up to fix that code and ensure that all known nodes are scanned periodically for volumes.

Also

I note this is marked as a bug but also noted a lot of new code. I wanted to ensure that this does not involve any feature development.

Yes this is bit more code than what I would have liked but this was relatively tricky issue to fix. But vsphere CSI driver migration is relatively 3-4 releases away and we would like intree driver to work without bugs while that lands.

@cheftako
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheftako, gnufied, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 16, 2020
@k8s-ci-robot k8s-ci-robot merged commit fc43c80 into kubernetes:master Dec 16, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Dec 16, 2020
@wilsonehusin
Copy link
Contributor

@gnufied hello from 1.21 release team! I'm not sure if I capture the meaning of release note described. Typically bug fixes release note has the format of:

Fix [bug]. The [component] now [new behavior description].

Would it be accurate to rephrase the PR description as the following? If so, would you please help us fill the ??? and update the PR description? Thanks!

Fixes dangling vSphere volumes. The ??? now periodically scans nodes to ensure volumes are tracked by attach-detach controller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants