Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cronjob_controller: add metrics for job creation skew duration #99341

Merged
merged 1 commit into from Mar 2, 2021

Conversation

alaypatel07
Copy link
Contributor

@alaypatel07 alaypatel07 commented Feb 23, 2021

What type of PR is this?

kind feature

What this PR does / why we need it:

It adds the following metrics to the new cronjob controller:

  • histogram for cronjob_job_creation_skew_duration_seconds

Which issue(s) this PR fixes:

Part of kubernetes/enhancements#19

Special notes for your reviewer:

/assign @soltysh

Addresses

Skew (actualJobCreationTime-expectedJobCreationTime) - Histogram

In order to avoid merge conflicts, this PR depends on #97098

Does this PR introduce a user-facing change?

Adds two new metrics to cronjobs, a  histogram to track the time difference when a job is created and the expected time when it should be created, and a gauge for the missed schedules of a cronjob

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP] - https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/19-Graduate-CronJob-to-Stable/README.md

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 23, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @alaypatel07. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 23, 2021
Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test
/kind feature
/priority important-soon
/triage accepted
/assign @dashpole
for instrumentation

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 24, 2021
CronjobCreationSkew = metrics.NewHistogram(
&metrics.HistogramOpts{
Subsystem: CronjobControllerV2Subsystem,
Name: "cronjob_creation_skew",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should end in duration_seconds to indicate the unit, and match other duration metrics. See https://prometheus.io/docs/practices/naming/#metric-names for where this originally comes from.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about cronjob_job_creation_skew_duration_seconds so that it better express what skew we're talking about. Alternatively cronjob_job_creation_duration_seconds, but I'm leaning towards the former.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like cronjob_job_creation_skew_duration_seconds, will update accordingly

CronjobCreationSkew = metrics.NewHistogram(
&metrics.HistogramOpts{
Subsystem: CronjobControllerV2Subsystem,
Name: "cronjob_creation_skew",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should end in duration_seconds to indicate the unit, and match other duration metrics. See https://prometheus.io/docs/practices/naming/#metric-names for where this originally comes from.

&metrics.HistogramOpts{
Subsystem: CronjobControllerV2Subsystem,
Name: "cronjob_creation_skew",
Help: "Time skew (in seconds) for each cronjob from when it was expected and when it actually did",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you don't need to describe the units (which are captured in the name), and the dimensions (which are labels) in the description.

Perhaps Time between when a cronjob is scheduled to be run, and when the corresponding job is created

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please.

case err != nil:
// default error handling
jm.recorder.Eventf(cj, corev1.EventTypeWarning, "FailedCreate", "Error creating job: %v", err)
return cj, nil, err
}

metrics.CronjobCreationSkew.Observe(time.Since(*scheduledTime).Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to measure from the scheduled time to the creationTimestamp of the job object. That way the metric matches exactly what is observed by users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashpole I am not sure. The thing that trips me is, if the user wants to look at creation timestamp, why is it obvious for the user to look at the job creationTimestamp and not at the pod/s that is running the actual code for the job?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh well, the description you suggested makes it clear. if the user wants to see the metrics for pods, they should be looking at jobs metrics, not cronjob metrics, +1 I think I answered my own questions, will update, +1

Copy link
Contributor

@dashpole dashpole Feb 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in general, I just want to be able to easily tell exactly what the metric is measuring from the description, and make sure that it actually is measuring what you say it is measuring. :)

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also one other metric that might be reasonable to expose it's missed start times, specifically right after calculating how many schedules we missed https://github.com/kubernetes/kubernetes/pull/97098/files#diff-3ad904bf8e9924ef15174726c413fd6d845bd0b9ec4fb344763c2caba15985f0R184

"k8s.io/component-base/metrics/legacyregistry"
)

const CronjobControllerV2Subsystem = "cronjob_controller_v2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit CronJob to match resource name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call the sub-system just cronjob_controller I don't want to expose to user V2 vs V1, especially that only V2 has metrics, and the other is on his way to be removed.

const CronjobControllerV2Subsystem = "cronjob_controller_v2"

var (
CronjobCreationSkew = metrics.NewHistogram(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

CronjobCreationSkew = metrics.NewHistogram(
&metrics.HistogramOpts{
Subsystem: CronjobControllerV2Subsystem,
Name: "cronjob_creation_skew",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about cronjob_job_creation_skew_duration_seconds so that it better express what skew we're talking about. Alternatively cronjob_job_creation_duration_seconds, but I'm leaning towards the former.

&metrics.HistogramOpts{
Subsystem: CronjobControllerV2Subsystem,
Name: "cronjob_creation_skew",
Help: "Time skew (in seconds) for each cronjob from when it was expected and when it actually did",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 26, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 26, 2021
@alaypatel07 alaypatel07 changed the title add metrics to cronjob controller v2 cronjob_controller: add metrics for job creation and missed schedules Feb 26, 2021
CronJobMissedSchedulesTotal = metrics.NewGaugeVec(
&metrics.GaugeOpts{
Subsystem: CronJobControllerSubsystem,
Name: "cronjob_missed_schedules",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add the _total suffix to this (from https://prometheus.io/docs/practices/naming/#metric-names)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it looks like this is the current number of missed schedules, so the total suffix isn't required, so you can ignore the previous comment. I'd suggest cronjob_current_missed_schedules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at the naming conventions and found this node_memory_usage_bytes. IMO, schedules sounds like the right unit for this metric. Considering this if you still feel otherwise, please let me know I can append _total to it. @dashpole

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. schedules is the correct unit. You can optionally add _current_ to the name if you want to make it clearer that it is an instantaneous, rather than cumulative metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is going to be instantaneous. I think schedules will be missed if the controller is overloaded or dead (not running for some reason). The controller will only learn about missed schedules after the fact, in retrospect, once it has recovered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By instantaneous, I just mean not-cumulative. A "point-in-time" measurement, rather than a "sum-over-time" measurement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alaypatel07 I'm assuming you're taking that other metrics as a followup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh yes, I plan to follow up.

Help: "Number of schedules missed for a cronjob before a job successfully was created",
StabilityLevel: metrics.ALPHA,
},
[]string{"namespace", "name"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are useful labels, but will create a memory leak in the controller, since label combinations are only ever added (with .Set( below), and aren't ever removed. You can either:

  1. Remove all unbounded dimensions (namespace + name), or
  2. Implement the StableCollector interface, which allows you to determine at scrape time which metric streams to omit. This would allow you to "drop" unwanted streams by not emitting them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the namespace and name are important metadata for the missed schedules metric to make sense, because from the namespace and name a user can look at what the spec.schedule value is and understand how severe the missed schedules are. For example, 5 missed schedules for a corn job scheduled every hour and 5 missed schedules for a cron scheduled every week has a very different severity level.

Since there is non-trivial work involved in adding this metric, do you mind if I take adding this metric as a follow-up? @dashpole

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just here to make sure you don't add memory leaks :). Its up to the reviewers/approvers of the controller code which metrics are required for the PR to be merged. I'm happy to review separate PRs for each metric.

@alaypatel07 alaypatel07 changed the title cronjob_controller: add metrics for job creation and missed schedules cronjob_controller: add metrics for job creation skew duration Mar 1, 2021
@alaypatel07
Copy link
Contributor Author

/retest

@alaypatel07
Copy link
Contributor Author

Flake

/test pull-kubernetes-unit

Name: "cronjob_job_creation_skew_duration_seconds",
Help: "Time between when a cronjob is scheduled to be run, and when the corresponding job is created",
StabilityLevel: metrics.ALPHA,
Buckets: []float64{1, 2.5, 5, 10, 15, 25, 50, 120, 300, 600, 1200},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about metrics.ExponentialBuckets(1, 2, 10) that will give us buckets: 1s, 2s, 4s, 8s, 16s, 32s, 1m, 2m, 4m, 8m+ which I think should be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's start simple with this one metric, the other one is debatable so it'll happen in a separate PR
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 2, 2021
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 2, 2021
@soltysh
Copy link
Contributor

soltysh commented Mar 2, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 2, 2021
@dashpole
Copy link
Contributor

dashpole commented Mar 2, 2021

/approve
for sig-instrumentation

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alaypatel07, dashpole, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alaypatel07
Copy link
Contributor Author

/test pull-kubernetes-dependencies

@alaypatel07
Copy link
Contributor Author

/test pull-kubernetes-verify

/test pull-kubernetes-unit

@k8s-ci-robot k8s-ci-robot merged commit 94d5019 into kubernetes:master Mar 2, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants