Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add migrated field to storage_operation_duration_seconds metric #99050

Merged
merged 4 commits into from Feb 17, 2021

Conversation

Jiawei0227
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds a "migrated" field to the existing storage_operation_duration_seconds metrics that are being exposed by kube-controller-mananger and kubelet.
The migrated field will be used to determine if the underlying volume it is operating is a CSI migrated volume. It will help SRE to better monitor the status of the CSI migration feature.

This is implemented by extend the opComplete method to record an extra metric field.

Which issue(s) this PR fixes:

Partially #98279

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Add `migrated` field into `storage_operation_duration_seconds` metric

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/sig storage
/assign @msau42
/cc @verult

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Feb 13, 2021
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/storage Categorizes an issue or PR as relevant to SIG Storage. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 13, 2021
@k8s-ci-robot
Copy link
Contributor

@Jiawei0227: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Feb 13, 2021
@@ -227,7 +227,8 @@ func (og *operationGenerator) GenerateVolumesAreAttachedFunc(
}
}

return nil, nil
// It is hard to differentiate migrated status for all volumes for verify_volumes_are_attached_per_node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually do translation for the verify operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do translation when generate the VolumeSpec. So it does not matter if it is verify operation.

verify is a bulk operation thus is difficult to differentiate. It will check attachedVolumes(which is a collection of VolumeSpec), and these VolumeSpecs, it can be partial CSI migrated specs and some real CSI specs. So we cannot tell the difference.

@@ -47,7 +49,7 @@ var storageOperationMetric = metrics.NewHistogramVec(
Buckets: []float64{.1, .25, .5, 1, 2.5, 5, 10, 15, 25, 50, 120, 300, 600},
StabilityLevel: metrics.ALPHA,
},
[]string{"volume_plugin", "operation_name", "status"},
[]string{"volume_plugin", "operation_name", "status", "migrated"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you enhance

func getMigrationVolumeOpCounts(cs clientset.Interface, pluginName string) (opCounts, opCounts) {
to also check for the new field in the metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the implementation of this here, some observation:

  1. The idea of migration volume metrics checking is to compare the in-tree plugin metrics has not changed before/after the test execution. => which implies no in-tree operation is invoked.

  2. This check does not utilize the corresponding CSI metrics at all. I think the reason might be

    • The test can be execute parallel and it would be hard to compare accurate result.
    • And also per the comments, some negative test cases might not emit any metrics.
  3. This is done using the to-be-deprecated metric "storage_operation_status_count". So we need to replace it with the correct metric.

And it seems that my change of adding a migrated status does not fit in the picture as we are not checking the csi metrics at all. And even if we want to check it will be hard because the observation 2 I mentioned above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, maybe we need a todo to try to test this via unit tests.

@msau42
Copy link
Member

msau42 commented Feb 17, 2021

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 17, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jiawei0227, msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 17, 2021
@Jiawei0227
Copy link
Contributor Author

/retest

1 similar comment
@Jiawei0227
Copy link
Contributor Author

/retest

Err: &context.DetailedErr,
Migrated: &context.Migrated,
}
c.Err = &detailedErr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I will delete this in another PR

@k8s-ci-robot k8s-ci-robot merged commit aa28a35 into kubernetes:master Feb 17, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 17, 2021
@@ -1558,7 +1559,7 @@ func (ctrl *PersistentVolumeController) provisionClaimOperation(

opComplete := util.OperationCompleteHook(plugin.GetPluginName(), "volume_provision")
volume, err = provisioner.Provision(selectedNode, allowedTopologies)
opComplete(&err)
opComplete(volumetypes.CompleteFuncParam{Err: &err})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the migrated field be part of provision and delete calls as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CSI Migration, if a plugin has been migrated, then provision and delete will be handle by csi-provisioner. So any call happens here is a non-csi migration call

completeFunc(&err)
completeFunc(types.CompleteFuncParam{
Err: &err,
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe less important but do we also want to pass in migrated info here for consistency? How hard is it to get the migrated field here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I might be wrong but this seems not to be a migrated operation? That being said, there is no difference in in-tree/CSI for this operation. So by default it will be migrated:false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants