DaemonSet controller respects MaxSurge during update #96441

smarterclayton · 2020-11-11T02:23:48Z

If MaxSurge is set, the controller will attempt to launch updated pods on up to MaxSurge nodes and wait for them to be ready before deleting the old pods. If any old pod goes unready during update, a new pod is added to the node (regardless of MaxSurge) and the old pod is deleted.

This implements the logic behind #96375

/kind feature

DaemonSets accept a MaxSurge integer or percent on their rolling update strategy that will launch the updated pod on nodes and wait for those pods to go ready before marking the old out-of-date pods as deleted. This allows workloads to avoid downtime during upgrades when deployed using DaemonSets. This feature is alpha and is behind the DaemonSetUpdateSurge feature gate.

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/1591-daemonset-surge

smarterclayton · 2020-11-11T02:24:05Z

/sig apps

smarterclayton · 2020-11-11T02:48:09Z

Added DaemonSetUpdateSurge to the e2e alpha suite in kubernetes/test-infra#19914

smarterclayton · 2020-11-11T02:49:14Z

/assign @kubernetes/sig-apps-pr-reviews

smarterclayton · 2021-02-09T14:53:14Z

/test pull-kubernetes-e2e-gce-alpha-features

smarterclayton · 2021-02-09T18:22:44Z

Ok, this is ready for review (all known issues are addressed, the test is stable, logging should be accurate and at the right levels). I'm fairly confident that this is in alpha shape at least.

jpbetz · 2021-02-09T21:21:43Z

/remove-sig api-machinery

soltysh · 2021-02-12T13:48:29Z

/triage accepted
/priority backlog

soltysh

This lgtm, leaving final tag to @kow3ns

It is too easy to omit checking the return value for the syncAndValidateDaemonSet test in large suites. Switch the method type to be a test helper and fatal/error directly. Also rename a method that referenced the old name 'Rollback' instead of 'RollingUpdate'.

While this is correct in order of operations, it is harder to read and masks the intent of the user without the parenthesis.

If MaxSurge is set, the controller will attempt to double up nodes up to the allowed limit with a new pod, and then when the most recent (by hash) pod is ready, trigger deletion on the old pod. If the old pod goes unready before the new pod is ready, the old pod is immediately deleted. If an old pod goes unready before a new pod is placed on that node, a new pod is immediately added for that node even past the MaxSurge limit. The backoff clock is used consistently throughout the daemonset controller as an injectable clock for the purposes of testing.

In order to maintain the correct invariants, the existing maxUnavailable logic calculated the same data several times in different ways. Leverage the simpler structure from maxSurge and calculate pod availability only once, as well as perform only a single pass over all the pods in the daemonset. This changed no behavior of the current controller, and has a structure that is almost identical to maxSurge.

The nodeShouldRunDaemonPod method does not need to return an error because there are no scenarios under which it fails. Remove the error return path for its direct calls as well.

smarterclayton · 2021-03-02T04:26:18Z

/refresh

smarterclayton · 2021-03-03T17:30:08Z

/refresh
/test pull-kubernetes-e2e-gce-alpha-features

soltysh

Having heard no objections for the past weeks, I'm tagging as is.
/lgtm
/retest

k8s-ci-robot · 2021-03-05T18:46:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/api/OWNERS~~ [smarterclayton]
~~pkg/controller/daemon/OWNERS~~ [smarterclayton,soltysh]
~~test/e2e/apps/OWNERS~~ [smarterclayton,soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2021-03-06T01:58:13Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

fejta-bot · 2021-03-06T05:49:13Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

fejta-bot · 2021-03-06T10:43:10Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

fejta-bot · 2021-03-06T13:31:12Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

smarterclayton changed the title ~~DaemonSet controller respects MaxSurge during update~~ WIP: DaemonSet controller respects MaxSurge during update Nov 11, 2020

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 11, 2020

k8s-ci-robot requested review from brendandburns and hongchaodeng November 11, 2020 02:24

smarterclayton force-pushed the daemonset_surge_impl branch 2 times, most recently from c6422a6 to 6c69080 Compare November 11, 2020 02:43

smarterclayton mentioned this pull request Nov 11, 2020

jobs/gcp: Enable DaemonSetUpdateSurge tests for alpha suite kubernetes/test-infra#19914

Merged

smarterclayton force-pushed the daemonset_surge_impl branch from 6c69080 to ccf4697 Compare November 11, 2020 03:05

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 12, 2020

smarterclayton force-pushed the daemonset_surge_impl branch from ccf4697 to 2811d52 Compare December 17, 2020 16:56

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 17, 2020

smarterclayton force-pushed the daemonset_surge_impl branch from 2811d52 to 63ae01c Compare January 19, 2021 01:28

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jan 19, 2021

smarterclayton force-pushed the daemonset_surge_impl branch from 91fb5f6 to d3cfd7f Compare February 8, 2021 17:37

k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 9, 2021

soltysh approved these changes Feb 12, 2021

View reviewed changes

smarterclayton added 5 commits March 1, 2021 13:20

podutil: Use parenthesis for clarity around the pod ready condition

6bac501

While this is correct in order of operations, it is harder to read and masks the intent of the user without the parenthesis.

daemonset: Remove unnecessary error returns in strategy code

8d8884a

The nodeShouldRunDaemonPod method does not need to return an error because there are no scenarios under which it fails. Remove the error return path for its direct calls as well.

smarterclayton force-pushed the daemonset_surge_impl branch from d3cfd7f to 8d8884a Compare March 1, 2021 18:26

soltysh approved these changes Mar 5, 2021

View reviewed changes

k8s-ci-robot assigned soltysh Mar 5, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 5, 2021

k8s-ci-robot merged commit 377ed3c into kubernetes:master Mar 6, 2021

k8s-ci-robot added this to the v1.21 milestone Mar 6, 2021

soltysh mentioned this pull request Mar 19, 2021

Tighten DS rollout catching function #100345

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DaemonSet controller respects MaxSurge during update #96441

DaemonSet controller respects MaxSurge during update #96441

smarterclayton commented Nov 11, 2020 •

edited

smarterclayton commented Nov 11, 2020

smarterclayton commented Nov 11, 2020

smarterclayton commented Nov 11, 2020

smarterclayton commented Feb 9, 2021

smarterclayton commented Feb 9, 2021

jpbetz commented Feb 9, 2021

soltysh commented Feb 12, 2021

soltysh left a comment

smarterclayton commented Mar 2, 2021

smarterclayton commented Mar 3, 2021

soltysh left a comment

k8s-ci-robot commented Mar 5, 2021

fejta-bot commented Mar 6, 2021

fejta-bot commented Mar 6, 2021

fejta-bot commented Mar 6, 2021

fejta-bot commented Mar 6, 2021

DaemonSet controller respects MaxSurge during update #96441

DaemonSet controller respects MaxSurge during update #96441

Conversation

smarterclayton commented Nov 11, 2020 • edited

smarterclayton commented Nov 11, 2020

smarterclayton commented Nov 11, 2020

smarterclayton commented Nov 11, 2020

smarterclayton commented Feb 9, 2021

smarterclayton commented Feb 9, 2021

jpbetz commented Feb 9, 2021

soltysh commented Feb 12, 2021

soltysh left a comment

Choose a reason for hiding this comment

smarterclayton commented Mar 2, 2021

smarterclayton commented Mar 3, 2021

soltysh left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 5, 2021

fejta-bot commented Mar 6, 2021

fejta-bot commented Mar 6, 2021

fejta-bot commented Mar 6, 2021

fejta-bot commented Mar 6, 2021

smarterclayton commented Nov 11, 2020 •

edited