New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add migrated field to storage_operation_duration_seconds metric #99050
Conversation
@Jiawei0227: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@@ -227,7 +227,8 @@ func (og *operationGenerator) GenerateVolumesAreAttachedFunc( | |||
} | |||
} | |||
|
|||
return nil, nil | |||
// It is hard to differentiate migrated status for all volumes for verify_volumes_are_attached_per_node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually do translation for the verify operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do translation when generate the VolumeSpec. So it does not matter if it is verify operation.
verify is a bulk operation thus is difficult to differentiate. It will check attachedVolumes(which is a collection of VolumeSpec), and these VolumeSpecs, it can be partial CSI migrated specs and some real CSI specs. So we cannot tell the difference.
@@ -47,7 +49,7 @@ var storageOperationMetric = metrics.NewHistogramVec( | |||
Buckets: []float64{.1, .25, .5, 1, 2.5, 5, 10, 15, 25, 50, 120, 300, 600}, | |||
StabilityLevel: metrics.ALPHA, | |||
}, | |||
[]string{"volume_plugin", "operation_name", "status"}, | |||
[]string{"volume_plugin", "operation_name", "status", "migrated"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you enhance
func getMigrationVolumeOpCounts(cs clientset.Interface, pluginName string) (opCounts, opCounts) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the implementation of this here, some observation:
-
The idea of migration volume metrics checking is to compare the in-tree plugin metrics has not changed before/after the test execution. => which implies no in-tree operation is invoked.
-
This check does not utilize the corresponding CSI metrics at all. I think the reason might be
- The test can be execute parallel and it would be hard to compare accurate result.
- And also per the comments, some negative test cases might not emit any metrics.
-
This is done using the to-be-deprecated metric "storage_operation_status_count". So we need to replace it with the correct metric.
And it seems that my change of adding a migrated status does not fit in the picture as we are not checking the csi metrics at all. And even if we want to check it will be hard because the observation 2 I mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, maybe we need a todo to try to test this via unit tests.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Jiawei0227, msau42 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
1 similar comment
/retest |
Err: &context.DetailedErr, | ||
Migrated: &context.Migrated, | ||
} | ||
c.Err = &detailedErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! I will delete this in another PR
@@ -1558,7 +1559,7 @@ func (ctrl *PersistentVolumeController) provisionClaimOperation( | |||
|
|||
opComplete := util.OperationCompleteHook(plugin.GetPluginName(), "volume_provision") | |||
volume, err = provisioner.Provision(selectedNode, allowedTopologies) | |||
opComplete(&err) | |||
opComplete(volumetypes.CompleteFuncParam{Err: &err}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the migrated field be part of provision and delete calls as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For CSI Migration, if a plugin has been migrated, then provision and delete will be handle by csi-provisioner. So any call happens here is a non-csi migration call
completeFunc(&err) | ||
completeFunc(types.CompleteFuncParam{ | ||
Err: &err, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe less important but do we also want to pass in migrated info here for consistency? How hard is it to get the migrated field here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I might be wrong but this seems not to be a migrated operation? That being said, there is no difference in in-tree/CSI for this operation. So by default it will be migrated:false
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds a "migrated" field to the existing
storage_operation_duration_seconds
metrics that are being exposed by kube-controller-mananger and kubelet.The
migrated
field will be used to determine if the underlying volume it is operating is a CSI migrated volume. It will help SRE to better monitor the status of the CSI migration feature.This is implemented by extend the opComplete method to record an extra metric field.
Which issue(s) this PR fixes:
Partially #98279
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/sig storage
/assign @msau42
/cc @verult