Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating EndpointSlice controller to wait for cache to be updated #99345

Merged

Conversation

robscott
Copy link
Member

@robscott robscott commented Feb 23, 2021

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This updates the EndpointSlice controller to make use of the EndpointSlice tracker to identify when expected changes are not present in the cache yet. If this is detected, the controller will wait to sync until all expected updates have been received. This should help avoid race conditions that would result in duplicate EndpointSlices or failed attempts to update stale EndpointSlices. To simplify this logic, this also moves the EndpointSlice tracker from relying on resource versions to generations.

Which issue(s) this PR fixes:

Fixes #92928

Special notes for your reviewer:

This is a follow up to #93520 that shares some similar concepts but is simpler thanks to some tips from @lavalamp. The EndpointSlice mirroring controller will need an identical change. The EndpointSlice trackers in each controller should really be merged, but I didn't want to bundle that into this change for the sake of simplicity and potentially ease of backporting.

Does this PR introduce a user-facing change?

EndpointSlice controller is now less likely to emit FailedToUpdateEndpointSlices events.

/sig network
/cc @bowei @swetharepakula @aojea
/assign @lavalamp

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 23, 2021
@k8s-ci-robot k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Feb 23, 2021
@k8s-ci-robot
Copy link
Contributor

@robscott: GitHub didn't allow me to request PR reviews from the following users: swetharepakula.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This updates the EndpointSlice controller to make use of the EndpointSlice tracker to identify when expected changes are not present in the cache yet. If this is detected, the controller will wait to sync until all expected updates have been received. This should help avoid race conditions that would result in duplicate EndpointSlices or failed attempts to update stale EndpointSlices. To simplify this logic, this also moves the EndpointSlice tracker from relying on resource versions to generations.

Which issue(s) this PR fixes:

Fixes #92928

Special notes for your reviewer:

This is a follow up to #93520 that shares some similar concepts but is simpler thanks to some tips from @lavalamp.

Does this PR introduce a user-facing change?

EndpointSlice controller is now less likely to emit FailedToUpdateEndpointSlices events.

/sig network
/cc @bowei @swetharepakula @aojea
/assign @lavalamp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 23, 2021
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 23, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: robscott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Feb 23, 2021
@robscott
Copy link
Member Author

/priority important-soon
/triage accepted

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 23, 2021
@robscott
Copy link
Member Author

/retest

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looked into non-test code, but I really like this approach.

@@ -280,6 +280,12 @@ func (c *Controller) handleErr(err error, key interface{}) {
return
}

if isStaleInformerCacheErr(err) {
Copy link
Member

@aojea aojea Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very ignorant in the workqueue area, but should we check the NumRequeues too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - I added a comment here to clarify that this was intentional. I think we'd want to wait indefinitely for the informer cache to update but that may be naive. @lavalamp do you have any tips here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the informer cache can't update you'll have bigger problems. That kind of problem should be caught by monitoring the metrics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lavalamp What kind of metric could I expose here? If I simply increment a counter every time we run into this error I think we'd end up with a lot of noise. Maybe a metric that tracks every time we go past max retries and give up for a Service?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update - I've removed this extra check and this error will not get any kind of special treatment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the only possible cause is the informer not having received an update, that will be reflected in existing metrics (it will be missing all updates in that case). If there's any chance of another kind of error, my first thought is to make two gauge metrics, one that counts the number of items in this condition, and one that adds up how many seconds those items have been in the condition. That way you can tell the difference between 1 persistently bad item and having a different item in this condition every time (which is not a problem).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think right now the other potential cause of an issue here would be someone manually setting an EndpointSlice generation lower than a previous value. As long as we can seal that up with a separate PR to prevent that in the strategy logic, I don't think we'll need much in the way of extra metrics.

For the two gauge metrics would you start a separate goroutine that did these calculations at some kind of fixed interval for all Services?

I've been thinking about adding a new metric that would track the number of syncs the controller has performed along with the outcome as a label with 3 potential values: success, staleCache, and error. That would not be as helpful in identifying outliers but would provide some broad insight into the health of the controller.

Since I'm interested in attempting to backport this, I think adding a metric directly to this PR would complicate that, so maybe this can be done as a follow up.

@robscott robscott force-pushed the endpointslice-wait-for-cache branch 2 times, most recently from 8c5a4b6 to 5f4c4c0 Compare February 23, 2021 18:38
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 23, 2021
@robscott robscott force-pushed the endpointslice-wait-for-cache branch from c0ca7cd to e154260 Compare March 2, 2021 17:44
@robscott
Copy link
Member Author

robscott commented Mar 2, 2021

After a bit more conversation with @wojtek-t I've moved away from RWMutex entirely in favor of Mutex since lock contention is unlikely to be a significant issue here.

@wojtek-t
Copy link
Member

wojtek-t commented Mar 2, 2021

After a bit more conversation with @wojtek-t I've moved away from RWMutex entirely in favor of Mutex since lock contention is unlikely to be a significant issue here.

Just to explain for posterity - RWMutex is basically more expensive (from cpu cost POV) so unless it provides visibly better experience (e.g. reduced e2e latency), it's not worth using it. And given that we generally have QPS limits in endpointslice controller which is producing them (and the fact that critical sections are super small) it doesn't make much sense.

@wojtek-t
Copy link
Member

wojtek-t commented Mar 2, 2021

/lgtm

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 2, 2021
@k8s-ci-robot
Copy link
Contributor

@robscott: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-integration e154260 link /test pull-kubernetes-integration
pull-kubernetes-e2e-gce-ubuntu-containerd e154260 link /test pull-kubernetes-e2e-gce-ubuntu-containerd

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@aojea
Copy link
Member

aojea commented Mar 2, 2021

/retest
some of them are known flakes but this test failures worries me

Kubernetes e2e suite: [sig-network] Networking Granular Checks: Services should function for service endpoints using hostNetwork expand_less

@k8s-ci-robot k8s-ci-robot merged commit ee90db5 into kubernetes:master Mar 2, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Mar 2, 2021
@BenTheElder
Copy link
Member

robscott added a commit to robscott/kubernetes that referenced this pull request Mar 4, 2021
…dated

This matches the recent updates to the EndpointSliceTracker for the
EndpointSlice controller in kubernetes#99345 that accomplished the same thing.
robscott added a commit to robscott/kubernetes that referenced this pull request Mar 11, 2021
…dated

This matches the recent updates to the EndpointSliceTracker for the
EndpointSlice controller in kubernetes#99345 that accomplished the same thing.
robscott added a commit to robscott/kubernetes that referenced this pull request Mar 11, 2021
…dated

This matches the recent updates to the EndpointSliceTracker for the
EndpointSlice controller in kubernetes#99345 that accomplished the same thing.
k8s-ci-robot added a commit that referenced this pull request Mar 12, 2021
…345-release-1.18

Automated cherry pick of #99345: Updating EndpointSlice controller to wait for cache to be
k8s-ci-robot added a commit that referenced this pull request Mar 12, 2021
…345-release-1.20

Automated cherry pick of #99345: Updating EndpointSlice controller to wait for cache to be
k8s-ci-robot added a commit that referenced this pull request Mar 12, 2021
…345-release-1.19

Automated cherry pick of #99345: Updating EndpointSlice controller to wait for cache to be
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FailedToUpdateEndpointSlices Error updating Endpoint Slices for Service
7 participants