apiserver add lease object count metric #97480

lingsamuel · 2020-12-23T04:10:13Z

Signed-off-by: Ling Samuel lingsamuelgrace@gmail.com

What type of PR is this?

/kind feature

What this PR does / why we need it:

Details: #96836

Which issue(s) this PR fixes:

Part of #96836

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Add metric etcd_lease_object_counts for kube-apiserver to observe max objects attached to a single etcd lease.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2020-12-23T04:10:21Z

Hi @lingsamuel. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lingsamuel · 2020-12-23T04:10:47Z

/sig scalability
/cc @mborsz @wojtek-t

fedebongio · 2021-01-05T21:13:40Z

/assign @wojtek-t @jingyih
/cc @jpbetz
/triage accepted

staging/src/k8s.io/apiserver/pkg/server/options/etcd.go

wojtek-t · 2021-01-07T19:02:01Z

staging/src/k8s.io/apiserver/pkg/server/options/etcd.go

 		"The time in seconds that each lease is reused. A lower value could avoid large number of objects reusing the same lease. Notice that a too small value may cause performance problems at storage layer.")
+
+	fs.Int64Var(&s.StorageConfig.LeaseManagerConfig.MaxObjectSize, "lease-max-object-size", s.StorageConfig.LeaseManagerConfig.MaxObjectSize,
+		"The max object size that each lease can attach. This option is under tuning, the default value is not a recommended value, just a placeholder currently. A lower value could avoid large number of objects reusing the same lease. Notice that a too small value may cause performance problems at storage layer.")


Once we make a flag, we shouldn't change the default if possible. I think we will can come up with a reasonable default at this point.

I never tested etcd lease performance before. What's your recommended value here?

I'm think maybe 5k? @mborsz - WDYT?

no idea :D

I suggest splitting this PR into two:

no behavioral change, only metrics + log below to understand the current values (it should be easily accessible in our 5k tests)

actual implementing lease limiting with value based on the above

Done. Once this get merged and we found a reasonable default value, update #96836 and I will submit a PR with --lease-max-object-count

staging/src/k8s.io/apiserver/pkg/storage/etcd3/lease_manager.go

staging/src/k8s.io/apiserver/pkg/storage/etcd3/metrics/metrics.go

staging/src/k8s.io/apiserver/pkg/storage/etcd3/lease_manager.go

wojtek-t · 2021-01-11T11:36:35Z

staging/src/k8s.io/apiserver/pkg/storage/etcd3/metrics/metrics.go

+			Buckets:        []float64{10, 50, 100, 500, 1000, 2500, 5000},
+			StabilityLevel: compbasemetrics.ALPHA,
+		},
+		[]string{"ttl"},


I'm wondering about this one - if at some point we decide to use TTL in a smarter way, we may end-up with huge histogram.
Do we really need this?

Assume we have many small TTL lease managers and one large TTL lease manager. Small TTL leases make buckets like 10 or 50 contains large data, then we can't know how the large TTL lease manager contributes to the 10/50 buckets.
But I think if we have many lease managers with different TTLs, they may need to have a different name. The name would be a better label at that time. Currently we don't have better field to use.

We could leave this problem here until we need to use multiple TTL lease managers.

wojtek-t · 2021-01-11T11:38:38Z

/ok-to-test

Signed-off-by: Ling Samuel <lingsamuelgrace@gmail.com>

wojtek-t · 2021-01-11T13:36:59Z

/lgtm
/approve

Thanks!

k8s-ci-robot · 2021-01-11T13:37:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lingsamuel, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/kube-apiserver/OWNERS~~ [wojtek-t]
~~staging/src/k8s.io/apiserver/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mborsz · 2021-01-19T14:43:39Z

I took a look at one of recent 5k runs (https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1351213015185231872 to be exact)

kube-apiserver.log-20210118-1610997908.gz:I0118 17:09:47.651903      11 lease_manager.go:111] The object count for lease 39cd771677946de5 is large: 69802
kube-apiserver.log-20210118-1610997908.gz:I0118 17:10:47.657706      11 lease_manager.go:111] The object count for lease 39cd771677955670 is large: 52340
kube-apiserver.log-20210118-1610997908.gz:I0118 17:11:47.664978      11 lease_manager.go:111] The object count for lease 39cd77167795bb81 is large: 19864
kube-apiserver.log-20210118-1610997908.gz:I0118 17:12:47.683576      11 lease_manager.go:111] The object count for lease 39cd771677961958 is large: 18631
kube-apiserver.log-20210118-1610997908.gz:I0118 17:13:47.704202      11 lease_manager.go:111] The object count for lease 39cd771677964def is large: 10401
kube-apiserver.log-20210118-1610997908.gz:I0118 17:31:58.161651      11 lease_manager.go:111] The object count for lease 39cd77167796a0ba is large: 18819
kube-apiserver.log-20210118-1610997908.gz:I0118 17:32:58.171436      11 lease_manager.go:111] The object count for lease 39cd77167796db2d is large: 14581
kube-apiserver.log-20210118-1610997908.gz:I0118 17:33:58.173026      11 lease_manager.go:111] The object count for lease 39cd771677970f77 is large: 13124
kube-apiserver.log-20210118-1610997908.gz:I0118 17:34:58.175166      11 lease_manager.go:111] The object count for lease 39cd771677974985 is large: 14612
kube-apiserver.log-20210118-1610997908.gz:I0118 17:35:58.180260      11 lease_manager.go:111] The object count for lease 39cd7716779785c3 is large: 15209
(...)

So overall, first two leases are quite large (50k-70k objects), other are around 10k-15k.

The clusterloader (the creating workload on the machine) has started at '17:15:53.168001', so this means that two first large leases have been created by cluster initialization logic and not tested workload.

In my opinion we should try playing with '--lease-reuse-duration-seconds' first to try to spread leases among one minute (e.g. assuming even distribution of events, setting 10s should split first large lease 6x, so to ~10k which should be reasonable).

…#97480-#98257-upstream-release-1.19 [1.19] Automated cherry pick of fixes for "large leases overload event etcd" issue (96836)

…#97480-#98257-upstream-release-1.20 [1.20] Automated cherry pick of fixes for "large leases overload event etcd" issue (96836)

…#97480-#98257-upstream-release-1.18 [1.18] Automated cherry pick of fixes for "large leases overload event etcd" issue (96836)

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 23, 2020

k8s-ci-robot requested review from mborsz and wojtek-t December 23, 2020 04:10

k8s-ci-robot requested review from hongchaodeng and ingvagabund December 23, 2020 04:11

lingsamuel force-pushed the etcd-lease-max-size branch 3 times, most recently from b73e3e9 to 61f3ca8 Compare December 23, 2020 04:22

ingvagabund removed their request for review January 5, 2021 11:27

k8s-ci-robot assigned jingyih and wojtek-t Jan 5, 2021

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jan 5, 2021

k8s-ci-robot requested a review from jpbetz January 5, 2021 21:13

k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 5, 2021

wojtek-t reviewed Jan 7, 2021

View reviewed changes

lingsamuel requested a review from mborsz January 11, 2021 08:53

lingsamuel changed the title ~~apiserver add --lease-max-object-size and metric~~ apiserver add --lease-max-object-count and metric Jan 11, 2021

lingsamuel force-pushed the etcd-lease-max-size branch 4 times, most recently from ddaf690 to 9e8e0bd Compare January 11, 2021 09:17

lingsamuel changed the title ~~apiserver add --lease-max-object-count and metric~~ apiserver add lease object count metric Jan 11, 2021

lingsamuel force-pushed the etcd-lease-max-size branch 2 times, most recently from 567a53e to 05d0f85 Compare January 11, 2021 09:44

wojtek-t reviewed Jan 11, 2021

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 11, 2021

apiserver add metric etcd_lease_object_counts

7e9fe39

Signed-off-by: Ling Samuel <lingsamuelgrace@gmail.com>

lingsamuel force-pushed the etcd-lease-max-size branch from 05d0f85 to 7e9fe39 Compare January 11, 2021 13:22

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 11, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 11, 2021

k8s-ci-robot merged commit e054aa2 into kubernetes:master Jan 11, 2021

k8s-ci-robot added this to the v1.21 milestone Jan 11, 2021

github-actions bot mentioned this pull request Jan 12, 2021

Week Ending January 10, 2021 dev-obs/actus#319

Open

lingsamuel mentioned this pull request Jan 21, 2021

lease manager limit max objects attached to a lease #98257

Merged

mborsz mentioned this pull request Mar 10, 2021

[1.20] Automated cherry pick of fixes for "large leases overload event etcd" issue (96836) #100084

Merged

This was referenced Mar 22, 2021

[1.19] Automated cherry pick of fixes for "large leases overload event etcd" issue (96836) #100450

Merged

[1.18] Automated cherry pick of fixes for "large leases overload event etcd" issue (96836) #100452

Merged

k8s-ci-robot added a commit that referenced this pull request Apr 8, 2021

Merge pull request #100450 from mborsz/automated-cherry-pick-of-#97009-…

75aca05

…#97480-#98257-upstream-release-1.19 [1.19] Automated cherry pick of fixes for "large leases overload event etcd" issue (96836)

k8s-ci-robot added a commit that referenced this pull request Apr 8, 2021

Merge pull request #100084 from mborsz/automated-cherry-pick-of-#97009-…

ca5eb11

…#97480-#98257-upstream-release-1.20 [1.20] Automated cherry pick of fixes for "large leases overload event etcd" issue (96836)

k8s-ci-robot added a commit that referenced this pull request Apr 8, 2021

Merge pull request #100452 from mborsz/automated-cherry-pick-of-#97009-…

8e67804

…#97480-#98257-upstream-release-1.18 [1.18] Automated cherry pick of fixes for "large leases overload event etcd" issue (96836)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apiserver add lease object count metric #97480

apiserver add lease object count metric #97480

lingsamuel commented Dec 23, 2020 •

edited

k8s-ci-robot commented Dec 23, 2020

lingsamuel commented Dec 23, 2020

fedebongio commented Jan 5, 2021

wojtek-t Jan 7, 2021

lingsamuel Jan 8, 2021

wojtek-t Jan 8, 2021

mborsz Jan 11, 2021

lingsamuel Jan 11, 2021

wojtek-t Jan 11, 2021

lingsamuel Jan 11, 2021

lingsamuel Jan 11, 2021

wojtek-t commented Jan 11, 2021

wojtek-t commented Jan 11, 2021

k8s-ci-robot commented Jan 11, 2021

mborsz commented Jan 19, 2021

apiserver add lease object count metric #97480

apiserver add lease object count metric #97480

Conversation

lingsamuel commented Dec 23, 2020 • edited

k8s-ci-robot commented Dec 23, 2020

lingsamuel commented Dec 23, 2020

fedebongio commented Jan 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Jan 11, 2021

wojtek-t commented Jan 11, 2021

k8s-ci-robot commented Jan 11, 2021

mborsz commented Jan 19, 2021

lingsamuel commented Dec 23, 2020 •

edited