New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exec credential provider: add rest_client_exec_plugin_call_total metric #98892
Conversation
Hi @ankeesler. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig auth |
7d4ae6b
to
b2f3409
Compare
execPluginCalls = k8smetrics.NewCounterVec( | ||
&k8smetrics.CounterOpts{ | ||
Name: "rest_client_exec_plugin_call_total", | ||
Help: "Number of calls to an exec plugin, partitioned by exit code. A special exit code of -1 " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to document a contract for what the exec authenticator does when it encounters exec plugin errors that it doesn't know how to process (e.g., the go stdlib failed to copy IO streams (very unlikely), the user does not have permission to exec the plugin). I figured this was a good enough place to document it since this help text will most likely be displayed to prometheus metric users in some fashion? For users that have a different metrics system...not sure where to document this contract...
I think we are safe with using negative numbers as special values, since POSIX uses the low-order 8 (unsigned) bits of an int
return value to indicate the status, and Windows treats return values as unsigned integers. See https://en.wikipedia.org/wiki/Exit_status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do this elsewhere too actually.
@liggitt may know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question here - do we need to document this metric publicly in any k/website (or elsewhere) docs? I am not familiar with how kube metrics are documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are safe with using negative numbers as special values, since POSIX uses the low-order 8 (unsigned) bits of an int return value to indicate the status, and Windows treats return values as unsigned integers. See https://en.wikipedia.org/wiki/Exit_status.
...
I think we do this elsewhere too actually.
@liggitt may know?
umm.... I ... don't know
do we need to document this metric publicly in any k/website (or elsewhere) docs?
In the KEP is a good starting point (https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/541-external-credential-providers/README.md#metrics). including in the user-doc as well would make sense to me
b2f3409
to
7308c50
Compare
"k8s.io/klog/v2" | ||
) | ||
|
||
// The following constants shadow the special values used in the prometheus metrics implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can lock this contract in place with an integration test that uses prometheus metrics?
/test pull-kubernetes-integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics looks okay to me, but I find the test a bit difficult to parse (mostly because of there is so much state).
pluginNotFoundExitCode = -1 | ||
// pluginReturnedStrangeErrorExitCode is the exit code passed to the metrics.ExecPluginCalls counter when a | ||
// call to an exec plugin returns an error that the code does not understand. | ||
pluginReturnedStrangeErrorExitCode = -2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/pluginReturnedStrangeErrorExitCode/unrecognizedPluginExitCode/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
execPluginCalls = k8smetrics.NewCounterVec( | ||
&k8smetrics.CounterOpts{ | ||
Name: "rest_client_exec_plugin_call_total", | ||
Help: "Number of calls to an exec plugin, partitioned by exit code. A special exit code of -1 " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do this elsewhere too actually.
@liggitt may know?
/test pull-kubernetes-e2e-kind-ipv6 I think that test flake was this: #99147. |
@liggitt - any chance you have time this week to do a sweep of this PR? |
execPluginCalls = k8smetrics.NewCounterVec( | ||
&k8smetrics.CounterOpts{ | ||
Name: "rest_client_exec_plugin_call_total", | ||
Help: "Number of calls to an exec plugin, partitioned by exit code. A special exit code of -1 " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are safe with using negative numbers as special values, since POSIX uses the low-order 8 (unsigned) bits of an int return value to indicate the status, and Windows treats return values as unsigned integers. See https://en.wikipedia.org/wiki/Exit_status.
...
I think we do this elsewhere too actually.
@liggitt may know?
umm.... I ... don't know
do we need to document this metric publicly in any k/website (or elsewhere) docs?
In the KEP is a good starting point (https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/541-external-credential-providers/README.md#metrics). including in the user-doc as well would make sense to me
// pluginNotFoundExitCode is the exit code passed to the metrics.ExecPluginCalls counter when a | ||
// call to an exec plugin fails because the exec plugin is not found. | ||
pluginNotFoundExitCode = -1 | ||
// unrecognizedPluginExitCode is the exit code passed to the metrics.ExecPluginCalls counter when | ||
// a call to an exec plugin returns an error that the code does not understand. | ||
unrecognizedPluginExitCode = -2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems overly clever to me... admission webhooks have a similar issue where an API call can be rejected in admission because 1) there was an API server error, 2) there was an error invoking the webhook, or 3) the admission webhook was called and correctly rejected the request.
They represent those cases with an extra label in the webhook_rejection_count metric (error_type = calling_webhook_error | apiserver_internal_error | no_error), rather than try to overload the code label
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @liggitt - i took a pass at this in e0b334b4204d62a3e7b13d6c20e15c226438624c
i tried to take the concept you shared and adapt it to this particular metric that we are mucking with. hopefully i got it somewhere close to where we want it.
dd21c7c
to
e0b334b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
// incrementCallsMetric increments a global metrics counter for the number of calls to an exec | ||
// plugin, partitioned by exit code. The provided err should be the return value from | ||
// exec.Cmd.Run(). | ||
func incrementCallsMetric(err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest using https://golang.org/pkg/errors/#As for these two error types to be able to handle wrapped errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
18b42aeab316651173d904b57827099712e6493f
metrics.ExecPluginCalls.Increment(err.(*exec.ExitError).ExitCode(), pluginCalledStatus) | ||
|
||
case *exec.Error: // Binary does not exist (see exec.Error). | ||
metrics.ExecPluginCalls.Increment(0, pluginNotFoundStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest setting code to 1 for pluginNotFoundStatus and clientInternalError... that allows someone to monitor non-0 codes for all unsuccessful invocations, and partition by the event type to understand the reason for the failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2da2732ae9cd5e83b01fceb50ce1c4606e0add28
"client_internal_error) and an optional exit code which is set with " + | ||
"plugin_called.", | ||
}, | ||
[]string{"code", "call_status"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is call_status a good description for this label, or would error_type
be clearer (and set to no_error
when err is nil)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i went back and forth with myself on call_status
vs error_type
. i think i landed on call_status
since it better matched the metric name (rest_client_exec_plugin_*call*_total
) and so we could get away with 3 enum values for this field instead of 4. i figured that if someone wanted to filter down to all the unsuccessful executions, they could just say exitCode != 0 && call_status == plugin_called
.
i'm not totally wed to this. i'll switch it over to no_error
and just let me know if you change your mind: 9e947251851a1b44ae4c7c11457932cf151b1467
metrics.ExecPluginCalls.Increment(0, pluginNotFoundStatus) | ||
|
||
default: // We don't know about this error type. | ||
klog.Warningf("unexpected exec plugin return error type: %T (error = %q)", err, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log lines in a metrics record method are a little unexpected... I'd probably drop this... the only thing it affects is the recorded metric, and people who care will be watching the metric, right?
edit: I guess we'd maybe want to know the type so we can fix the unexpected type, but this seems like it could be a lower-level message... Info level 2 or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e921dc9049ed37fd53424c2aeaea42df9f4db718
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
9e94725
to
31eec29
Compare
pull-kubernetes-dependencies failure is unrelated to this PR and is being worked on /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ankeesler, liggitt, logicalhan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Review the full test history for this PR. Silence the bot with an |
i heard a rumor these test issues were resolved now? let's see /retest |
/hold |
/hold cancel |
Signed-off-by: Andrew Keesler akeesler@vmware.com
What type of PR is this?
/kind feature
What this PR does / why we need it:
rest_client_exec_plugin_call_total
metric to client-go which tracks the number of calls to any exec plugin and partitions the calls by exit code.Which issue(s) this PR fixes:
Does this PR introduce a user-facing change?: