Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix aggregator_unavailable_apiservice gauge #96421

Merged

Conversation

dgrisonnet
Copy link
Member

@dgrisonnet dgrisonnet commented Nov 10, 2020

What type of PR is this?

/kind bug

What this PR does / why we need it:

When an apiservice is deleted, its relative aggregator_unavailable_apiservice metric remains with the value of the last availability observed. Hence, if an apiservice is deleted while being unavailable, the metric remains marked as unavailable.
This presents some issues when alerting on unavailable apiservices as deleted apiservices might trigger the alert indefinitely.

To solve this issue, the aggregator_unavailable_apiservice metric should only reflect the availability of existing apiservices.

This is achievable by using a custom Collector instead of a GaugeVec and create throw-away metrics based on an apiservice lister output. With this approach, on deletion, the apiservice will not be listed anymore, resulting in its availability metric not being exposed.

Which issue(s) this PR fixes:

Fixes #92671

Special notes for your reviewer:

The alert that I previously mentioned is the AggregatedAPIDown alert from the kubernetes-mixin project: https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/fe19e65e64183f4a54a4584c792608851f68a1e9/alerts/kube_apiserver.libsonnet#L88-L101
In my opinion, this alert is correct and the problem lies into the aggregator_unavailable_apiservice metric which is why I opened this PR.

Does this PR introduce a user-facing change?:

Fixed a bug where aggregator_unavailable_apiservice metrics were reported for deleted apiservices.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/sig instrumentation

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 10, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @dgrisonnet. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 10, 2020
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 10, 2020
@dgrisonnet
Copy link
Member Author

/assign @lavalamp

@fedebongio
Copy link
Contributor

/assign @deads2k
/unassign @lavalamp
/triage accepted
/cc @logicalhan

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Nov 10, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 10, 2020
@sttts
Copy link
Contributor

sttts commented Nov 11, 2020

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 11, 2020
@deads2k
Copy link
Contributor

deads2k commented Nov 11, 2020

/assign @sttts
/unassign

@k8s-ci-robot k8s-ci-robot assigned sttts and unassigned deads2k Nov 11, 2020
@dgrisonnet
Copy link
Member Author

@sttts I've updated this PR according to our previous discussions, can you please have another look at it?
I removed most of the global states, but we still need to keep a global sync.Once to avoid registering the metrics multiple times in the global registry. As you suggested, I moved this global closer to where we call the metric registration and it indeed looks way better. With this change, it should now be possible to register the metrics with multiple registry without being limited by the global registry constraints.
Also, the metrics are now owned by the controller, and clients are not leaked into the collection logic anymore. All the logic is now self-contained in the collector itself.

@dgrisonnet dgrisonnet force-pushed the fix-apiservice-availability branch 2 times, most recently from 1a0a6b5 to 5c2cb0d Compare November 18, 2020 22:08
When an apiservice is deleted, its relative
aggregator_unavailable_apiservice metric remains with the value of the
last availability observed. Hence, if an apiservice is deleted while
being unavailable, the metric remains marked as unavailable.
This presents some problems when alerting on unavailable apiservices
as deleted apiservices might trigger the alert indefinitely.

To solve this issue, the aggregator_unavailable_apiservice metric should
only reflect the availability of existing apiservices.

This is achievable by using a custom Collector instead of a GaugeVec and
create throw-away metrics based on an apiservice lister output. With
this approach, on deletion, the apiservice will not be listed anymore,
resulting in its availability metric not being exposed.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
@dgrisonnet dgrisonnet changed the title WIP: Fix aggregator_unavailable_apiservice gauge Fix aggregator_unavailable_apiservice gauge Nov 19, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 19, 2020
@sttts
Copy link
Contributor

sttts commented Nov 19, 2020

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 19, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrisonnet, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Nov 19, 2020
@sttts sttts added this to the v1.20 milestone Nov 26, 2020
@k8s-ci-robot k8s-ci-robot merged commit 5ed4b76 into kubernetes:master Nov 26, 2020
@ialidzhikov
Copy link
Contributor

@dgrisonnet , does it makes sense to also cherry-pick this PR into release-1.17, release-1.18 and release-.1.19?

@dgrisonnet
Copy link
Member Author

@ialidzhikov, I am not too familiar with k/k backport process but I don't think it makes sense here. Even though the bug exists in past releases, it's a corner case with a fairly low impact so I don't think it is worth cherry-picking.

@dgrisonnet dgrisonnet deleted the fix-apiservice-availability branch November 27, 2020 15:38
@sttts
Copy link
Contributor

sttts commented Jan 11, 2021

This needs backports.

k8s-ci-robot added a commit that referenced this pull request Feb 4, 2021
…6421-upstream-release-1.18

Automated cherry pick of #96421: kube-aggregator: fix apiservice availability gauge
k8s-ci-robot added a commit that referenced this pull request Feb 15, 2021
…6421-upstream-release-1.19

Automated cherry pick of #96421: kube-aggregator: fix apiservice availability gauge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

After the aggregation api is deleted, the deleted api still exists in the kube-apiserver metrics
7 participants