Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexed job implementation #98812

Merged
merged 1 commit into from Mar 3, 2021

Conversation

alculquicondor
Copy link
Member

@alculquicondor alculquicondor commented Feb 5, 2021

What type of PR is this?

/kind feature

What this PR does / why we need it:

/sig apps
/area batch
/area workload-api/job

Implements support for Indexed Job.

For an indexed Job, the job controller creates a Pod with an associated index from 0 to (.spec.completions-1). The index is added as an annotation.

Other implementation details:

  • When feature gate is disabled, the job controller doesn't process indexed jobs.
  • The job controller creates Pods for the lowest indexes that don't have active or succeeded pods.
  • If there are more than one pod for an index, the controller removes all but one, using the same strategy where the number of active pods exceeds parallelism.
  • Active pods that don't have an index are removed.
  • Finished pods that don't have an index don't count towards failures or successes, but they are not removed.

Which issue(s) this PR fixes:

Fixes #97169 #14188

Ref kubernetes/enhancements#2214

Special notes for your reviewer:

Builds on top of #98441
Integration test to follow

Does this PR introduce a user-facing change?:

Support for Indexed Job: a Job that is considered completed when Pods associated to indexes from 0 to (.spec.completions-1) have succeeded.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: git.k8s.io/enhancements/keps/sig-apps/2214-indexed-job
- [Usage]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#indexed-job

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. sig/apps Categorizes an issue or PR as relevant to SIG Apps. area/batch area/workload-api/job cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubectl kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/cli Categorizes an issue or PR as relevant to SIG CLI. labels Feb 5, 2021
@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@alculquicondor
Copy link
Member Author

cc @ahg-g @adtac @soltysh

// The Job is considered complete when there is one successfully completed Pod
// for each index.
// When value is `Indexed`, .spec.completions must be specified and
// `.spec.parallelism` must be less than or equal to 10^5.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helps to explain why there's a limit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a design decision. I don't think it's good to make it part of user documentation (user docs are build from this file).

// Job get an associated completion index from 0 to (.spec.completions - 1),
// available in the annotation batch.alpha.kubernetes.io/job-completion-index.
// The Job is considered complete when there is one successfully completed Pod
// for each index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible for a pod to successfully complete more than once? if so, should job authors design their pods around this (idempotency)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a separate PR for API changes #98441

Idempotency requirement is already documented for regular Jobs.

// This field is alpha-level and is only honored by servers that enable the
// IndexedJob feature gate. More completion modes can be added in the future.
// If the Job controller observes a mode that it doesn't recognize, the
// controller skips updates for the Job.
Copy link
Member

@adtac adtac Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, not sure if silently failing is the right thing to do.

edit: the validation code seems to raise an error? please update the doc to reflect that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about the controller, not the apiserver. The apiserver would reject the unknown value.

This is a safeguard to add more completion modes in the future. The idea is that if the controller is one version behind during an upgrade, it has some way of handling a value it doesn't know.

pkg/controller/job/indexed_job_utils.go Outdated Show resolved Hide resolved
phase v1.PodPhase
}

func hasFailingPods(status []indexPhase) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in job_controller_test.go
moved

pkg/controller/job/indexed_job_utils.go Show resolved Hide resolved
pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved
func capIndexesList(indexes string, softLimit int) string {
ix := softLimit
for ; ix < len(indexes); ix++ {
if indexes[ix] == ',' {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will panic if len(indexes) < softLimit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 2192 doesn't let that happen

staging/src/k8s.io/kubectl/pkg/describe/describe.go Outdated Show resolved Hide resolved
if ix >= len(indexes) {
return indexes
}
return indexes[:ix+1] + "..."
Copy link
Member

@adtac adtac Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect indexes to be completed in order? i.e. is index N likely to complete before index N+K for large K?

if so, since we're cutting the string off anyway, the last 50 chars might be more informational to know how much is complete (although I'll concede that it'll be a little weird to read at first)

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if parallelism is smaller than completions, yes, smaller indexes will complete earlier.

What should happen is that the first interval should contain most of the values, like 0-123456. The rest of the values would be more sporadic. I think the first interval actually gives you the most information: all indexes up to 123456 have completed, and there are some other completions here and there.

@alculquicondor
Copy link
Member Author

/retest

@alculquicondor
Copy link
Member Author

/retest

1 similar comment
@alculquicondor
Copy link
Member Author

/retest

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 10, 2021
@ahg-g
Copy link
Member

ahg-g commented Feb 11, 2021

/assign

just so I can see it on my backlog

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 11, 2021
@alculquicondor alculquicondor force-pushed the indexed-job branch 2 times, most recently from 90f2524 to e850ccb Compare February 11, 2021 20:47
@alculquicondor
Copy link
Member Author

/retest

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 19, 2021
@alculquicondor
Copy link
Member Author

/retest

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 23, 2021
@soltysh
Copy link
Contributor

soltysh commented Feb 23, 2021

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 23, 2021
@soltysh
Copy link
Contributor

soltysh commented Feb 23, 2021

/priority backlog

@k8s-ci-robot k8s-ci-robot added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 23, 2021
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 23, 2021
@alculquicondor
Copy link
Member Author

/retest

When .spec.completionMode="Indexed"
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 3, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2021
@ahg-g
Copy link
Member

ahg-g commented Mar 3, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 3, 2021
@k8s-ci-robot k8s-ci-robot merged commit afb1ee3 into kubernetes:master Mar 3, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/batch area/kubectl area/workload-api/job cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cli Categorizes an issue or PR as relevant to SIG CLI. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support ArrayJob semantics in the job api
6 participants