Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add csi_operations_seconds metrics on kubelet #98979

Merged
merged 1 commit into from Feb 19, 2021

Conversation

Jiawei0227
Copy link
Contributor

@Jiawei0227 Jiawei0227 commented Feb 11, 2021

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds a new metrics called csi_operations_seconds for kubelet. It records the kubelet CSI operation to the driver.
Example:

csi_operations_seconds_bucket{driver_name="pd.csi.storage.gke.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeStageVolume",migrated="true",le="0.1"} 0
csi_operations_seconds_bucket{driver_name="pd.csi.storage.gke.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeStageVolume",migrated="true",le="0.25"} 0
csi_operations_seconds_bucket{driver_name="pd.csi.storage.gke.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeStageVolume",migrated="true",le="0.5"} 0
csi_operations_seconds_bucket{driver_name="pd.csi.storage.gke.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeStageVolume",migrated="true",le="1"} 1
csi_operations_seconds_bucket{driver_name="pd.csi.storage.gke.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeStageVolume",migrated="true",le="2.5"} 1
csi_operations_seconds_bucket{driver_name="pd.csi.storage.gke.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeStageVolume",migrated="true",le="5"} 1
csi_operations_seconds_bucket{driver_name="pd.csi.storage.gke.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeStageVolume",migrated="true",le="10"} 1
csi_operations_seconds_bucket{driver_name="pd.csi.storage.gke.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeStageVolume",migrated="true",le="15"} 1
csi_operations_seconds_bucket{driver_name="pd.csi.storage.gke.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeStageVolume",migrated="true",le="25"} 1
csi_operations_seconds_bucket{driver_name="pd.csi.storage.gke.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeStageVolume",migrated="true",le="50"} 1
csi_operations_seconds_bucket{driver_name="pd.csi.storage.gke.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeStageVolume",migrated="true",le="120"} 1
csi_operations_seconds_bucket{driver_name="pd.csi.storage.gke.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeStageVolume",migrated="true",le="300"} 1
csi_operations_seconds_bucket{driver_name="pd.csi.storage.gke.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeStageVolume",migrated="true",le="600"} 1
csi_operations_seconds_bucket{driver_name="pd.csi.storage.gke.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeStageVolume",migrated="true",le="+Inf"} 1
csi_operations_seconds_sum{driver_name="pd.csi.storage.gke.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeStageVolume",migrated="true"} 0.843569755
csi_operations_seconds_bucket_count{driver_name="pd.csi.storage.gke.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeStageVolume",migrated="true"} 1

This is useful to get the operation metrics from Node side. Similarly, we have the csi_sidecar_operations_seconds which runs on the sidecar containers and that only records the controller side operation. With this new metric, we get a full picture of the metrics.

Which issue(s) this PR fixes:

#98279 is related

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Add `csi_operations_seconds` metric on kubelet that exposes CSI operations duration and status for node CSI operations.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/sig storage
/assign @msau42
/cc @mattcary @saad-ali

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Feb 11, 2021
@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 11, 2021
@k8s-ci-robot
Copy link
Contributor

@Jiawei0227: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Feb 11, 2021
@Jiawei0227
Copy link
Contributor Author

/retest

@Jiawei0227
Copy link
Contributor Author

/retest

@Jiawei0227 Jiawei0227 force-pushed the kubelet_csi branch 3 times, most recently from 384c0fc to 9c79848 Compare February 17, 2021 19:24
@@ -80,6 +82,17 @@ var storageOperationEndToEndLatencyMetric = metrics.NewHistogramVec(
[]string{"plugin_name", "operation_name"},
)

var csiOperationsLatencyMetric = metrics.NewHistogramVec(
&metrics.HistogramOpts{
Subsystem: "csi_kubelet",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this as a const instead so it's clear what to use if more csi metrics are added.

And it doesn't look like we have any subsystems defined. I think it might be simpler to just have a "csi" subsystem? eg, the existing storage_operation_duration_seconds metric doesn't have one version for the master and one for the kubelet, the same metric is used both places. I think it would be a good idea to be consistent with that for CSI metrics too.

Unless there's some other context here for a csi_kubelet subsystem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am doing the sub_system thing here because the current sidecar CSI metric name is: csi_sidecar_operations_seconds, I dont think the sidecar part fits in here. So I am coming up with this name. I am fine with csi_operations_seconds here as well. But note that this metrics only exists for the kubelet part and kube-controller-manager is not exposing this metric at all.

https://github.com/kubernetes-csi/csi-lib-utils/blob/44b91e2a7b0af8199fc65c1142f664fd5f7abb96/metrics/metrics.go#L39

@mattcary
Copy link
Contributor

mattcary commented Feb 17, 2021 via email

@Jiawei0227 Jiawei0227 changed the title Add csi_kubelet_operations_seconds metrics on kubelet Add csi_operations_seconds metrics on kubelet Feb 17, 2021
@Jiawei0227
Copy link
Contributor Author

/retest

@msau42
Copy link
Member

msau42 commented Feb 19, 2021

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 19, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jiawei0227, msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 19, 2021
@k8s-ci-robot k8s-ci-robot merged commit b2a5d67 into kubernetes:master Feb 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants