New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use force umount for nfs volumes #96844
Conversation
b3eea65
to
4f3cf33
Compare
/retest |
One concern is unmount timeout might caused by different reasons, e.g.,
In case the 1, the recommended way is to use unmount with force. But how about the second case? Normally the volume is only used by associated Pods. The workflow is to kill the containers which are using the volume, and then cleanup volumes. So in this aspect, maybe case 2 should not happen? |
It is not possible to unmount a directory if something uses it, even with |
@@ -61,7 +62,8 @@ var _ volume.PersistentVolumePlugin = &nfsPlugin{} | |||
var _ volume.RecyclableVolumePlugin = &nfsPlugin{} | |||
|
|||
const ( | |||
nfsPluginName = "kubernetes.io/nfs" | |||
nfsPluginName = "kubernetes.io/nfs" | |||
unMountTimeout = time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just want to double check whether this is optimal timeout setting in this case..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no 100% sure way to know the answer. We have deployed some CSI driver with this value and it appears to be working fine. Unlike block devices - NFS volumes don't have to fsync on umount, so I think it should be fine.
klog.Warningf("Warning: Unmount skipped because path does not exist: %v", mountPath) | ||
return nil | ||
} | ||
corruptedMnt := IsCorruptedMnt(pathErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, pathExists == true. Based on the logic of PathExists, only two cases
- pathExists and error == nil
- if error != nil, then is must be corruptedMnt == true
So I think there is no need to check IsCorruptedMnt again here ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, good point. but I think there is a logic error in this code.
- If pathExists is false and
pathErr
is nil. then and then only it should skip unmounting. - If pathExists is false but
pathErr
is somehow non-nil, then unmounting should not succeed and should result in error and I think we have had this bug in the code for awhile now. :(
I have pushed an update that fixes this PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still feel there are some cleanup can be done related to the checking before and after actual unmount execution and path removal. Especially the checking logic before actually executing unmount or dir deletion has some issues. We have seen issues around this with some kind of corrupted dir (even though dir still exist, but PathExists show it does not..). Also since pathExists has redundant check on IsCorruptedMnt.
But this further cleanup can be done in a different PR. For this one, let's focus on adding Force option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you say - PathExists
has a redundant check of IsCorruptedMnt
? But if it was - then I think it was always there and I am not changing any of that code. If anything - this PR I think fixes the bug with ignoring errors from PathExists
check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. for this change it is good for me.
/triage accepted |
3fbc7f5
to
95852d7
Compare
/kind bug |
/assign @jingxu97 |
var notMnt bool | ||
var err error | ||
if !corruptedMnt { | ||
_, err = removePathIfNotMountPoint(mountPath, mounter, extensiveMountPointCheck) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here as long as pathErr != nil (previously already check pathExists == true) I think you can use removePathIfNotMountPoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this comment. You have commented on a outdated diff and I have changed this code.
/priority important-soon |
CleanupMountWithForce and CleanupMountPoint should have the exact logic except the former uses mounter.UnmountWithForce and latter uses mounter.Unmount() function. So do you think passing a parameter option e.g. "bool useForceUnmount" passed to CleanupMountPoint() function is easier? |
|
/retest |
} | ||
|
||
func forceUmount(path string) error { | ||
cmd := exec.Command("umount", "-f", path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so should we consider to have a timeout even for force unmount?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same question for normal Unmount function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same question for normal Unmount function.
For normal unmount it is bit problematic to change behaviour for all existing drivers. As we know that some storage types flush pending writes when a volume is unmounted - and hence unmount time could vary and it is not safe to kill unmount process when this happens. So I think we never want to use a universal timeout for normal unmount function for all drivers.
so should we consider to have a timeout even for force unmount?
force works best for unreachable NFS system. It is documented directly in the man page. But it is not guaranteed to not hang and as far as I know if it hangs in other cases, it could be hanging for a legitimate reason (such as volume being used or something else) and in those cases using a timeout will basically kill the unmount process and it could be problematic. So, I think I would not want to use a timeout for force unmount. It could result in a leaked goroutine but that is okay (it is okay because we are not going to retry unmount if previous unmount is hung), we always leaked goroutines in this code path.
just a coupe of questions which can be followed up later. This change is good for me. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gnufied, jingxu97 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
Fixes #96678
/sig storage