Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use force umount for nfs volumes #96844

Merged
merged 2 commits into from Dec 18, 2020

Conversation

gnufied
Copy link
Member

@gnufied gnufied commented Nov 25, 2020

Fixes #96678

/sig storage

Use force unmount for NFS volumes if regular mount fails after 1 minute timeout

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 25, 2020
@gnufied
Copy link
Member Author

gnufied commented Nov 25, 2020

/retest

@jingxu97
Copy link
Contributor

One concern is unmount timeout might caused by different reasons, e.g.,

  1. NFS server is down (lost network connection with server)
  2. Some other process are using this dir

In case the 1, the recommended way is to use unmount with force. But how about the second case?

Normally the volume is only used by associated Pods. The workflow is to kill the containers which are using the volume, and then cleanup volumes. So in this aspect, maybe case 2 should not happen?

@jsafrane
Copy link
Member

  1. Some other process are using this dir

It is not possible to unmount a directory if something uses it, even with -f. You still get device is busy.

@@ -61,7 +62,8 @@ var _ volume.PersistentVolumePlugin = &nfsPlugin{}
var _ volume.RecyclableVolumePlugin = &nfsPlugin{}

const (
nfsPluginName = "kubernetes.io/nfs"
nfsPluginName = "kubernetes.io/nfs"
unMountTimeout = time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to double check whether this is optimal timeout setting in this case..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no 100% sure way to know the answer. We have deployed some CSI driver with this value and it appears to be working fine. Unlike block devices - NFS volumes don't have to fsync on umount, so I think it should be fine.

klog.Warningf("Warning: Unmount skipped because path does not exist: %v", mountPath)
return nil
}
corruptedMnt := IsCorruptedMnt(pathErr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, pathExists == true. Based on the logic of PathExists, only two cases

  1. pathExists and error == nil
  2. if error != nil, then is must be corruptedMnt == true

So I think there is no need to check IsCorruptedMnt again here ..

Copy link
Member Author

@gnufied gnufied Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, good point. but I think there is a logic error in this code.

  1. If pathExists is false and pathErr is nil. then and then only it should skip unmounting.
  2. If pathExists is false but pathErr is somehow non-nil, then unmounting should not succeed and should result in error and I think we have had this bug in the code for awhile now. :(

I have pushed an update that fixes this PTAL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still feel there are some cleanup can be done related to the checking before and after actual unmount execution and path removal. Especially the checking logic before actually executing unmount or dir deletion has some issues. We have seen issues around this with some kind of corrupted dir (even though dir still exist, but PathExists show it does not..). Also since pathExists has redundant check on IsCorruptedMnt.

But this further cleanup can be done in a different PR. For this one, let's focus on adding Force option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you say - PathExists has a redundant check of IsCorruptedMnt? But if it was - then I think it was always there and I am not changing any of that code. If anything - this PR I think fixes the bug with ignoring errors from PathExists check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. for this change it is good for me.

@gnufied
Copy link
Member Author

gnufied commented Dec 15, 2020

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 15, 2020
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 15, 2020
@gnufied
Copy link
Member Author

gnufied commented Dec 15, 2020

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Dec 15, 2020
@gnufied
Copy link
Member Author

gnufied commented Dec 15, 2020

/assign @jingxu97

var notMnt bool
var err error
if !corruptedMnt {
_, err = removePathIfNotMountPoint(mountPath, mounter, extensiveMountPointCheck)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here as long as pathErr != nil (previously already check pathExists == true) I think you can use removePathIfNotMountPoint

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this comment. You have commented on a outdated diff and I have changed this code.

@gnufied
Copy link
Member Author

gnufied commented Dec 15, 2020

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 15, 2020
@jingxu97
Copy link
Contributor

CleanupMountWithForce and CleanupMountPoint should have the exact logic except the former uses mounter.UnmountWithForce and latter uses mounter.Unmount() function. So do you think passing a parameter option e.g. "bool useForceUnmount" passed to CleanupMountPoint() function is easier?

@gnufied
Copy link
Member Author

gnufied commented Dec 16, 2020

CleanupMountPoint function is part of public interface of mount utils package. It is used in many cases in k8s and potentially outside core k8s. I did not wanted to break any existing usage if I could help. But also it is just one boolean flag, we also have to potentially supply timeout. I think it is bit more cleaner to have separate function for this, so as we don't break existing usage.

@gnufied
Copy link
Member Author

gnufied commented Dec 17, 2020

/retest

}

func forceUmount(path string) error {
cmd := exec.Command("umount", "-f", path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so should we consider to have a timeout even for force unmount?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same question for normal Unmount function

Copy link
Member Author

@gnufied gnufied Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same question for normal Unmount function.

For normal unmount it is bit problematic to change behaviour for all existing drivers. As we know that some storage types flush pending writes when a volume is unmounted - and hence unmount time could vary and it is not safe to kill unmount process when this happens. So I think we never want to use a universal timeout for normal unmount function for all drivers.

so should we consider to have a timeout even for force unmount?

force works best for unreachable NFS system. It is documented directly in the man page. But it is not guaranteed to not hang and as far as I know if it hangs in other cases, it could be hanging for a legitimate reason (such as volume being used or something else) and in those cases using a timeout will basically kill the unmount process and it could be problematic. So, I think I would not want to use a timeout for force unmount. It could result in a leaked goroutine but that is okay (it is okay because we are not going to retry unmount if previous unmount is hung), we always leaked goroutines in this code path.

@jingxu97
Copy link
Contributor

just a coupe of questions which can be followed up later.

This change is good for me.
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 18, 2020
@jingxu97
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, jingxu97

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 18, 2020
@gnufied
Copy link
Member Author

gnufied commented Dec 18, 2020

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use force unmount for nfs if regular unmount fails
4 participants