New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generic ephemeral volume: add metrics #99115
generic ephemeral volume: add metrics #99115
Conversation
6461c73
to
a4174a7
Compare
/retest |
/assign @gnufied |
StabilityLevel: metrics.ALPHA, | ||
}, | ||
[]string{"reason"}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide an example of how metrics appear in prometheus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My tentative update of the KEP has such examples: https://github.com/kubernetes/enhancements/pull/2513/files
Is that what you were looking for or shall I add it here, for example to the commit message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah nvm. I also see this captured in release-notes.
Before we merge this change - do we know this metric can be extracted from api-server logs or api-server somehow? I would think that api-server logs should contain some information about incoming PVC creation requests (may not be as metric directly). |
The ask from the production readiness review was explicitly to add metrics. Extracting the same information via log analysis might work, but it's simply not the same thing. |
cc @kubernetes/sig-instrumentation-members @logicalhan |
I understand. I am bit concerned about creating approximately 20-25 new timeseries metric because reason is being used a a dimension. |
I have no strong opinion either way. I can implement whatever is considered appropriate for this feature. We can also reduce the status to just success/failure, status just seemed a bit more flexible. If we use a boolean, what are a good name and values for such a dimension? |
reason = "Unknown" | ||
} | ||
} | ||
ephemeralvolumemetrics.EphemeralVolumeCreate.WithLabelValues(reason).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is reason
practically bound here? What kind of values are we expecting here for this to be valuable?
@@ -279,6 +283,14 @@ func (ec *ephemeralController) handleVolume(pod *v1.Pod, vol v1.Volume) error { | |||
Spec: ephemeral.VolumeClaimTemplate.Spec, | |||
} | |||
_, err = ec.kubeClient.CoreV1().PersistentVolumeClaims(pod.Namespace).Create(context.TODO(), pvc, metav1.CreateOptions{}) | |||
reason := "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having reason
possible to be an empty string is going to make querying this metric awkward. I think we probably want a label that distinguishes between success and failure, or have separate metrics for each of those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with a label success
that has values yes
and no
. Is that okay or are there better names? I am not sure what is used elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best practice is to have two metrics one for failures and one for all. Then you can naturally query failures / total
and understand the percentage. The number of series you produce is identical, so this doesn't produce any more load or anything on the metrics system than what we already have here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed to to count ephemeral_volume_controller_create_total
and ephemeral_volume_controller_create_failures
. Is that okay?
Note that my updated tests initially failed because GetCounterMetricValue
didn't work for a Counter
. I'm including the fix for that (= adding the missing Metric interface implementation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that works!
/remove-sig api-machinery |
/retest |
/remove-sig api-machinery |
/lgtm |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a tiny naming change to align with guidelines
EphemeralVolumeCreateFailures = metrics.NewCounter( | ||
&metrics.CounterOpts{ | ||
Subsystem: EphemeralVolumeSubsystem, | ||
Name: "create_failures", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize for not catching this earlier. The guidelines are to always end counters with _total
so this one needs that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, please take another look.
5ef4b7d
to
c3add2d
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: brancz, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
As discussed during the production readiness review, a metric for the PVC create operations is useful. The "ephemeral_volume" workqueue metrics were already added in the initial implementation. The new code follows the example set by the endpoints controller.
GetCounterMetricValue has an unchecked conversion to metrics.Metric, something that didn't work for a *Counter because it didn't implement that interface.
A CounterVector with status as label may create unnecessary overhead and using the success case with the empty label value wasn't easy. It's better to have two seperate counters, one for total number of calls and one for failed calls.
/retest |
c3add2d
to
98f7529
Compare
/lgtm |
/retest Review the full test history for this PR. Silence the bot with an |
1 similar comment
/retest Review the full test history for this PR. Silence the bot with an |
What type of PR is this?
/kind feature
What this PR does / why we need it:
As discussed during the production readiness review, a metric for the
PVC create operations is useful. The "ephemeral_volume" workqueue
metrics were already added in the initial implementation.
Special notes for your reviewer:
The new code follows the example set by the endpoints controller.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/sig storage