Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exec credential provider: add rest_client_exec_plugin_call_total metric #98892

Merged
merged 1 commit into from Mar 4, 2021

Conversation

ankeesler
Copy link
Contributor

Signed-off-by: Andrew Keesler akeesler@vmware.com

What type of PR is this?

/kind feature

What this PR does / why we need it:

  • Add rest_client_exec_plugin_call_total metric to client-go which tracks the number of calls to any exec plugin and partitions the calls by exit code.

Which issue(s) this PR fixes:

Does this PR introduce a user-facing change?:

A client-go metric, rest_client_exec_plugin_call_total, has been added to track total calls to client-go credential plugins.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 8, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @ankeesler. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 8, 2021
@ankeesler
Copy link
Contributor Author

/sig auth

@k8s-ci-robot k8s-ci-robot added sig/auth Categorizes an issue or PR as relevant to SIG Auth. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 8, 2021
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Feb 8, 2021
execPluginCalls = k8smetrics.NewCounterVec(
&k8smetrics.CounterOpts{
Name: "rest_client_exec_plugin_call_total",
Help: "Number of calls to an exec plugin, partitioned by exit code. A special exit code of -1 " +
Copy link
Contributor Author

@ankeesler ankeesler Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to document a contract for what the exec authenticator does when it encounters exec plugin errors that it doesn't know how to process (e.g., the go stdlib failed to copy IO streams (very unlikely), the user does not have permission to exec the plugin). I figured this was a good enough place to document it since this help text will most likely be displayed to prometheus metric users in some fashion? For users that have a different metrics system...not sure where to document this contract...

I think we are safe with using negative numbers as special values, since POSIX uses the low-order 8 (unsigned) bits of an int return value to indicate the status, and Windows treats return values as unsigned integers. See https://en.wikipedia.org/wiki/Exit_status.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do this elsewhere too actually.

@liggitt may know?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question here - do we need to document this metric publicly in any k/website (or elsewhere) docs? I am not familiar with how kube metrics are documented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are safe with using negative numbers as special values, since POSIX uses the low-order 8 (unsigned) bits of an int return value to indicate the status, and Windows treats return values as unsigned integers. See https://en.wikipedia.org/wiki/Exit_status.

...
I think we do this elsewhere too actually.
@liggitt may know?

umm.... I ... don't know

do we need to document this metric publicly in any k/website (or elsewhere) docs?

In the KEP is a good starting point (https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/541-external-credential-providers/README.md#metrics). including in the user-doc as well would make sense to me

"k8s.io/klog/v2"
)

// The following constants shadow the special values used in the prometheus metrics implementation.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can lock this contract in place with an integration test that uses prometheus metrics?

@ankeesler ankeesler changed the title WIP: exec credential provider: add rest_client_exec_plugin_call_total metric exec credential provider: add rest_client_exec_plugin_call_total metric Feb 9, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2021
@ankeesler
Copy link
Contributor Author

/cc @liggitt @enj @logicalhan @ehashman

@ankeesler
Copy link
Contributor Author

/test pull-kubernetes-integration

@liggitt liggitt added this to the v1.21 milestone Feb 18, 2021
Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics looks okay to me, but I find the test a bit difficult to parse (mostly because of there is so much state).

pluginNotFoundExitCode = -1
// pluginReturnedStrangeErrorExitCode is the exit code passed to the metrics.ExecPluginCalls counter when a
// call to an exec plugin returns an error that the code does not understand.
pluginReturnedStrangeErrorExitCode = -2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/pluginReturnedStrangeErrorExitCode/unrecognizedPluginExitCode/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

execPluginCalls = k8smetrics.NewCounterVec(
&k8smetrics.CounterOpts{
Name: "rest_client_exec_plugin_call_total",
Help: "Number of calls to an exec plugin, partitioned by exit code. A special exit code of -1 " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do this elsewhere too actually.

@liggitt may know?

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 19, 2021
@ankeesler
Copy link
Contributor Author

/test pull-kubernetes-e2e-kind-ipv6

I think that test flake was this: #99147.

@ankeesler
Copy link
Contributor Author

@liggitt - any chance you have time this week to do a sweep of this PR?

execPluginCalls = k8smetrics.NewCounterVec(
&k8smetrics.CounterOpts{
Name: "rest_client_exec_plugin_call_total",
Help: "Number of calls to an exec plugin, partitioned by exit code. A special exit code of -1 " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are safe with using negative numbers as special values, since POSIX uses the low-order 8 (unsigned) bits of an int return value to indicate the status, and Windows treats return values as unsigned integers. See https://en.wikipedia.org/wiki/Exit_status.

...
I think we do this elsewhere too actually.
@liggitt may know?

umm.... I ... don't know

do we need to document this metric publicly in any k/website (or elsewhere) docs?

In the KEP is a good starting point (https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/541-external-credential-providers/README.md#metrics). including in the user-doc as well would make sense to me

Comment on lines 33 to 38
// pluginNotFoundExitCode is the exit code passed to the metrics.ExecPluginCalls counter when a
// call to an exec plugin fails because the exec plugin is not found.
pluginNotFoundExitCode = -1
// unrecognizedPluginExitCode is the exit code passed to the metrics.ExecPluginCalls counter when
// a call to an exec plugin returns an error that the code does not understand.
unrecognizedPluginExitCode = -2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems overly clever to me... admission webhooks have a similar issue where an API call can be rejected in admission because 1) there was an API server error, 2) there was an error invoking the webhook, or 3) the admission webhook was called and correctly rejected the request.

They represent those cases with an extra label in the webhook_rejection_count metric (error_type = calling_webhook_error | apiserver_internal_error | no_error), rather than try to overload the code label

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @liggitt - i took a pass at this in e0b334b4204d62a3e7b13d6c20e15c226438624c

i tried to take the concept you shared and adapt it to this particular metric that we are mucking with. hopefully i got it somewhere close to where we want it.

Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 2, 2021
// incrementCallsMetric increments a global metrics counter for the number of calls to an exec
// plugin, partitioned by exit code. The provided err should be the return value from
// exec.Cmd.Run().
func incrementCallsMetric(err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest using https://golang.org/pkg/errors/#As for these two error types to be able to handle wrapped errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

18b42aeab316651173d904b57827099712e6493f

metrics.ExecPluginCalls.Increment(err.(*exec.ExitError).ExitCode(), pluginCalledStatus)

case *exec.Error: // Binary does not exist (see exec.Error).
metrics.ExecPluginCalls.Increment(0, pluginNotFoundStatus)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest setting code to 1 for pluginNotFoundStatus and clientInternalError... that allows someone to monitor non-0 codes for all unsuccessful invocations, and partition by the event type to understand the reason for the failure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2da2732ae9cd5e83b01fceb50ce1c4606e0add28

"client_internal_error) and an optional exit code which is set with " +
"plugin_called.",
},
[]string{"code", "call_status"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is call_status a good description for this label, or would error_type be clearer (and set to no_error when err is nil)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i went back and forth with myself on call_status vs error_type. i think i landed on call_status since it better matched the metric name (rest_client_exec_plugin_*call*_total) and so we could get away with 3 enum values for this field instead of 4. i figured that if someone wanted to filter down to all the unsuccessful executions, they could just say exitCode != 0 && call_status == plugin_called.

i'm not totally wed to this. i'll switch it over to no_error and just let me know if you change your mind: 9e947251851a1b44ae4c7c11457932cf151b1467

metrics.ExecPluginCalls.Increment(0, pluginNotFoundStatus)

default: // We don't know about this error type.
klog.Warningf("unexpected exec plugin return error type: %T (error = %q)", err, err.Error())
Copy link
Member

@liggitt liggitt Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log lines in a metrics record method are a little unexpected... I'd probably drop this... the only thing it affects is the recorded metric, and people who care will be watching the metric, right?

edit: I guess we'd maybe want to know the type so we can fix the unexpected type, but this seems like it could be a lower-level message... Info level 2 or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e921dc9049ed37fd53424c2aeaea42df9f4db718

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 3, 2021
Signed-off-by: Andrew Keesler <akeesler@vmware.com>
@liggitt
Copy link
Member

liggitt commented Mar 3, 2021

pull-kubernetes-dependencies failure is unrelated to this PR and is being worked on

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 3, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ankeesler, liggitt, logicalhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2021
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@ankeesler
Copy link
Contributor Author

i heard a rumor these test issues were resolved now? let's see

/retest

@dims
Copy link
Member

dims commented Mar 4, 2021

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2021
@dims
Copy link
Member

dims commented Mar 4, 2021

/hold cancel
/retest

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2021
@k8s-ci-robot k8s-ci-robot merged commit cd54b19 into kubernetes:master Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants