cronjob_controller: add metrics for job creation skew duration #99341

alaypatel07 · 2021-02-23T04:08:44Z

What type of PR is this?

kind feature

What this PR does / why we need it:

It adds the following metrics to the new cronjob controller:

histogram for cronjob_job_creation_skew_duration_seconds

Which issue(s) this PR fixes:

Part of kubernetes/enhancements#19

Special notes for your reviewer:

/assign @soltysh

Addresses

Skew (actualJobCreationTime-expectedJobCreationTime) - Histogram

In order to avoid merge conflicts, this PR depends on #97098

Does this PR introduce a user-facing change?

Adds two new metrics to cronjobs, a  histogram to track the time difference when a job is created and the expected time when it should be created, and a gauge for the missed schedules of a cronjob

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP] - https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/19-Graduate-CronJob-to-Stable/README.md

k8s-ci-robot · 2021-02-23T04:08:52Z

Hi @alaypatel07. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ehashman

/ok-to-test
/kind feature
/priority important-soon
/triage accepted
/assign @dashpole
for instrumentation

dashpole · 2021-02-24T17:18:15Z

pkg/controller/cronjob/metrics/metrics.go

+	CronjobCreationSkew = metrics.NewHistogram(
+		&metrics.HistogramOpts{
+			Subsystem:      CronjobControllerV2Subsystem,
+			Name:           "cronjob_creation_skew",


This should end in duration_seconds to indicate the unit, and match other duration metrics. See https://prometheus.io/docs/practices/naming/#metric-names for where this originally comes from.

How about cronjob_job_creation_skew_duration_seconds so that it better express what skew we're talking about. Alternatively cronjob_job_creation_duration_seconds, but I'm leaning towards the former.

I like cronjob_job_creation_skew_duration_seconds, will update accordingly

dashpole · 2021-02-24T17:18:52Z

pkg/controller/cronjob/metrics/metrics.go

+	CronjobCreationSkew = metrics.NewHistogram(
+		&metrics.HistogramOpts{
+			Subsystem:      CronjobControllerV2Subsystem,
+			Name:           "cronjob_creation_skew",


This should end in duration_seconds to indicate the unit, and match other duration metrics. See https://prometheus.io/docs/practices/naming/#metric-names for where this originally comes from.

dashpole · 2021-02-24T17:29:09Z

pkg/controller/cronjob/metrics/metrics.go

+		&metrics.HistogramOpts{
+			Subsystem:      CronjobControllerV2Subsystem,
+			Name:           "cronjob_creation_skew",
+			Help:           "Time skew (in seconds) for each cronjob from when it was expected and when it actually did",


nit: you don't need to describe the units (which are captured in the name), and the dimensions (which are labels) in the description.

Perhaps Time between when a cronjob is scheduled to be run, and when the corresponding job is created

yes, please.

dashpole · 2021-02-24T17:32:28Z

pkg/controller/cronjob/cronjob_controllerv2.go

 	case err != nil:
 		// default error handling
 		jm.recorder.Eventf(cj, corev1.EventTypeWarning, "FailedCreate", "Error creating job: %v", err)
 		return cj, nil, err
 	}
+
+	metrics.CronjobCreationSkew.Observe(time.Since(*scheduledTime).Seconds())


It might be better to measure from the scheduled time to the creationTimestamp of the job object. That way the metric matches exactly what is observed by users.

@dashpole I am not sure. The thing that trips me is, if the user wants to look at creation timestamp, why is it obvious for the user to look at the job creationTimestamp and not at the pod/s that is running the actual code for the job?

ahh well, the description you suggested makes it clear. if the user wants to see the metrics for pods, they should be looking at jobs metrics, not cronjob metrics, +1 I think I answered my own questions, will update, +1

Yeah, in general, I just want to be able to easily tell exactly what the metric is measuring from the description, and make sure that it actually is measuring what you say it is measuring. :)

soltysh

There's also one other metric that might be reasonable to expose it's missed start times, specifically right after calculating how many schedules we missed https://github.com/kubernetes/kubernetes/pull/97098/files#diff-3ad904bf8e9924ef15174726c413fd6d845bd0b9ec4fb344763c2caba15985f0R184

soltysh · 2021-02-25T19:22:14Z

pkg/controller/cronjob/metrics/metrics.go

+	"k8s.io/component-base/metrics/legacyregistry"
+)
+
+const CronjobControllerV2Subsystem = "cronjob_controller_v2"


Nit CronJob to match resource name.

Call the sub-system just cronjob_controller I don't want to expose to user V2 vs V1, especially that only V2 has metrics, and the other is on his way to be removed.

soltysh · 2021-02-25T19:22:19Z

pkg/controller/cronjob/metrics/metrics.go

+const CronjobControllerV2Subsystem = "cronjob_controller_v2"
+
+var (
+	CronjobCreationSkew = metrics.NewHistogram(


soltysh · 2021-02-25T19:27:40Z

pkg/controller/cronjob/metrics/metrics.go

+	CronjobCreationSkew = metrics.NewHistogram(
+		&metrics.HistogramOpts{
+			Subsystem:      CronjobControllerV2Subsystem,
+			Name:           "cronjob_creation_skew",


How about cronjob_job_creation_skew_duration_seconds so that it better express what skew we're talking about. Alternatively cronjob_job_creation_duration_seconds, but I'm leaning towards the former.

soltysh · 2021-02-25T19:28:14Z

pkg/controller/cronjob/metrics/metrics.go

+		&metrics.HistogramOpts{
+			Subsystem:      CronjobControllerV2Subsystem,
+			Name:           "cronjob_creation_skew",
+			Help:           "Time skew (in seconds) for each cronjob from when it was expected and when it actually did",


yes, please.

dashpole · 2021-03-01T15:27:47Z

pkg/controller/cronjob/metrics/metrics.go

+	CronJobMissedSchedulesTotal = metrics.NewGaugeVec(
+		&metrics.GaugeOpts{
+			Subsystem:      CronJobControllerSubsystem,
+			Name:           "cronjob_missed_schedules",


please add the _total suffix to this (from https://prometheus.io/docs/practices/naming/#metric-names)

Ah, it looks like this is the current number of missed schedules, so the total suffix isn't required, so you can ignore the previous comment. I'd suggest cronjob_current_missed_schedules.

I was looking at the naming conventions and found this node_memory_usage_bytes. IMO, schedules sounds like the right unit for this metric. Considering this if you still feel otherwise, please let me know I can append _total to it. @dashpole

You are correct. schedules is the correct unit. You can optionally add _current_ to the name if you want to make it clearer that it is an instantaneous, rather than cumulative metric.

I don't think it is going to be instantaneous. I think schedules will be missed if the controller is overloaded or dead (not running for some reason). The controller will only learn about missed schedules after the fact, in retrospect, once it has recovered.

By instantaneous, I just mean not-cumulative. A "point-in-time" measurement, rather than a "sum-over-time" measurement.

@alaypatel07 I'm assuming you're taking that other metrics as a followup?

@soltysh yes, I plan to follow up.

dashpole · 2021-03-01T15:40:10Z

pkg/controller/cronjob/metrics/metrics.go

+			Help:           "Number of schedules missed for a cronjob before a job successfully was created",
+			StabilityLevel: metrics.ALPHA,
+		},
+		[]string{"namespace", "name"},


These are useful labels, but will create a memory leak in the controller, since label combinations are only ever added (with .Set( below), and aren't ever removed. You can either:

Remove all unbounded dimensions (namespace + name), or

Implement the StableCollector interface, which allows you to determine at scrape time which metric streams to omit. This would allow you to "drop" unwanted streams by not emitting them.

I think the namespace and name are important metadata for the missed schedules metric to make sense, because from the namespace and name a user can look at what the spec.schedule value is and understand how severe the missed schedules are. For example, 5 missed schedules for a corn job scheduled every hour and 5 missed schedules for a cron scheduled every week has a very different severity level.

Since there is non-trivial work involved in adding this metric, do you mind if I take adding this metric as a follow-up? @dashpole

I'm just here to make sure you don't add memory leaks :). Its up to the reviewers/approvers of the controller code which metrics are required for the PR to be merged. I'm happy to review separate PRs for each metric.

alaypatel07 · 2021-03-01T23:11:08Z

/retest

alaypatel07 · 2021-03-02T02:14:48Z

Flake

/test pull-kubernetes-unit

soltysh · 2021-03-02T13:00:44Z

pkg/controller/cronjob/metrics/metrics.go

+			Name:           "cronjob_job_creation_skew_duration_seconds",
+			Help:           "Time between when a cronjob is scheduled to be run, and when the corresponding job is created",
+			StabilityLevel: metrics.ALPHA,
+			Buckets:        []float64{1, 2.5, 5, 10, 15, 25, 50, 120, 300, 600, 1200},


How about metrics.ExponentialBuckets(1, 2, 10) that will give us buckets: 1s, 2s, 4s, 8s, 16s, 32s, 1m, 2m, 4m, 8m+ which I think should be sufficient.

soltysh

let's start simple with this one metric, the other one is debatable so it'll happen in a separate PR
/lgtm
/approve

soltysh · 2021-03-02T15:42:24Z

/lgtm

dashpole · 2021-03-02T15:46:40Z

/approve
for sig-instrumentation

k8s-ci-robot · 2021-03-02T15:47:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alaypatel07, dashpole, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/cronjob/OWNERS~~ [soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alaypatel07 · 2021-03-02T16:59:22Z

/test pull-kubernetes-dependencies

alaypatel07 · 2021-03-02T17:04:18Z

/test pull-kubernetes-verify

/test pull-kubernetes-unit

k8s-ci-robot assigned soltysh Feb 23, 2021

k8s-ci-robot requested review from kow3ns and mortent February 23, 2021 04:09

alaypatel07 force-pushed the metrics branch from eca783b to 216566b Compare February 23, 2021 06:05

ehashman reviewed Feb 24, 2021

View reviewed changes

k8s-ci-robot assigned dashpole Feb 24, 2021

dashpole reviewed Feb 24, 2021

View reviewed changes

kendallroden mentioned this pull request Feb 24, 2021

CronJobs (previously ScheduledJobs) kubernetes/enhancements#19

Closed

16 tasks

soltysh requested changes Feb 25, 2021

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 26, 2021

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 26, 2021

alaypatel07 force-pushed the metrics branch from a7ed169 to 9d6c4af Compare February 26, 2021 22:48

alaypatel07 changed the title ~~add metrics to cronjob controller v2~~ cronjob_controller: add metrics for job creation and missed schedules Feb 26, 2021

alaypatel07 force-pushed the metrics branch 3 times, most recently from 1416ac1 to 1ffe78c Compare February 26, 2021 23:15

alaypatel07 mentioned this pull request Feb 27, 2021

cronjob: update deps #99515

Closed

dashpole reviewed Mar 1, 2021

View reviewed changes

alaypatel07 force-pushed the metrics branch from 1ffe78c to 8f30877 Compare March 1, 2021 18:50

alaypatel07 changed the title ~~cronjob_controller: add metrics for job creation and missed schedules~~ cronjob_controller: add metrics for job creation skew duration Mar 1, 2021

soltysh reviewed Mar 2, 2021

View reviewed changes

alaypatel07 force-pushed the metrics branch from 8f30877 to 69176c7 Compare March 2, 2021 13:38

soltysh approved these changes Mar 2, 2021

View reviewed changes

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 2, 2021

cronjob_controller: add metrics for job creation skew duration

08bc827

alaypatel07 force-pushed the metrics branch from 69176c7 to 08bc827 Compare March 2, 2021 15:39

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 2, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 2, 2021

k8s-ci-robot merged commit 94d5019 into kubernetes:master Mar 2, 2021

k8s-ci-robot added this to the v1.21 milestone Mar 2, 2021

alaypatel07 mentioned this pull request Apr 4, 2021

REQUEST: New membership for alaypatel07 kubernetes/org#2615

Closed

6 tasks

cronjob_controller: add metrics for job creation skew duration #99341

cronjob_controller: add metrics for job creation skew duration #99341

Conversation

alaypatel07 commented Feb 23, 2021 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Feb 23, 2021

ehashman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dashpole Feb 24, 2021 • edited

Choose a reason for hiding this comment

soltysh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alaypatel07 commented Mar 1, 2021

alaypatel07 commented Mar 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh left a comment

Choose a reason for hiding this comment

soltysh commented Mar 2, 2021

dashpole commented Mar 2, 2021

k8s-ci-robot commented Mar 2, 2021

alaypatel07 commented Mar 2, 2021

alaypatel07 commented Mar 2, 2021

alaypatel07 commented Feb 23, 2021 •

edited

dashpole Feb 24, 2021 •

edited