Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apiserver add lease object count metric #97480

Merged
merged 1 commit into from Jan 11, 2021

Conversation

lingsamuel
Copy link
Contributor

@lingsamuel lingsamuel commented Dec 23, 2020

Signed-off-by: Ling Samuel lingsamuelgrace@gmail.com

What type of PR is this?

/kind feature

What this PR does / why we need it:

Details: #96836

Which issue(s) this PR fixes:

Part of #96836

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Add metric etcd_lease_object_counts for kube-apiserver to observe max objects attached to a single etcd lease.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 23, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @lingsamuel. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 23, 2020
@lingsamuel
Copy link
Contributor Author

/sig scalability
/cc @mborsz @wojtek-t

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 23, 2020
@lingsamuel lingsamuel force-pushed the etcd-lease-max-size branch 3 times, most recently from b73e3e9 to 61f3ca8 Compare December 23, 2020 04:22
@ingvagabund ingvagabund removed their request for review January 5, 2021 11:27
@fedebongio
Copy link
Contributor

/assign @wojtek-t @jingyih
/cc @jpbetz
/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jan 5, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 5, 2021
staging/src/k8s.io/apiserver/pkg/server/options/etcd.go Outdated Show resolved Hide resolved
"The time in seconds that each lease is reused. A lower value could avoid large number of objects reusing the same lease. Notice that a too small value may cause performance problems at storage layer.")

fs.Int64Var(&s.StorageConfig.LeaseManagerConfig.MaxObjectSize, "lease-max-object-size", s.StorageConfig.LeaseManagerConfig.MaxObjectSize,
"The max object size that each lease can attach. This option is under tuning, the default value is not a recommended value, just a placeholder currently. A lower value could avoid large number of objects reusing the same lease. Notice that a too small value may cause performance problems at storage layer.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we make a flag, we shouldn't change the default if possible. I think we will can come up with a reasonable default at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never tested etcd lease performance before. What's your recommended value here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm think maybe 5k? @mborsz - WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea :D

I suggest splitting this PR into two:

  • no behavioral change, only metrics + log below to understand the current values (it should be easily accessible in our 5k tests)
  • actual implementing lease limiting with value based on the above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Once this get merged and we found a reasonable default value, update #96836 and I will submit a PR with --lease-max-object-count

@lingsamuel lingsamuel changed the title apiserver add --lease-max-object-size and metric apiserver add --lease-max-object-count and metric Jan 11, 2021
@lingsamuel lingsamuel force-pushed the etcd-lease-max-size branch 4 times, most recently from ddaf690 to 9e8e0bd Compare January 11, 2021 09:17
@lingsamuel lingsamuel changed the title apiserver add --lease-max-object-count and metric apiserver add lease object count metric Jan 11, 2021
@lingsamuel lingsamuel force-pushed the etcd-lease-max-size branch 2 times, most recently from 567a53e to 05d0f85 Compare January 11, 2021 09:44
Buckets: []float64{10, 50, 100, 500, 1000, 2500, 5000},
StabilityLevel: compbasemetrics.ALPHA,
},
[]string{"ttl"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about this one - if at some point we decide to use TTL in a smarter way, we may end-up with huge histogram.
Do we really need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume we have many small TTL lease managers and one large TTL lease manager. Small TTL leases make buckets like 10 or 50 contains large data, then we can't know how the large TTL lease manager contributes to the 10/50 buckets.
But I think if we have many lease managers with different TTLs, they may need to have a different name. The name would be a better label at that time. Currently we don't have better field to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could leave this problem here until we need to use multiple TTL lease managers.

@wojtek-t
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 11, 2021
Signed-off-by: Ling Samuel <lingsamuelgrace@gmail.com>
@wojtek-t
Copy link
Member

/lgtm
/approve

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 11, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lingsamuel, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 11, 2021
@k8s-ci-robot k8s-ci-robot merged commit e054aa2 into kubernetes:master Jan 11, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Jan 11, 2021
@mborsz
Copy link
Member

mborsz commented Jan 19, 2021

I took a look at one of recent 5k runs (https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1351213015185231872 to be exact)

kube-apiserver.log-20210118-1610997908.gz:I0118 17:09:47.651903      11 lease_manager.go:111] The object count for lease 39cd771677946de5 is large: 69802
kube-apiserver.log-20210118-1610997908.gz:I0118 17:10:47.657706      11 lease_manager.go:111] The object count for lease 39cd771677955670 is large: 52340
kube-apiserver.log-20210118-1610997908.gz:I0118 17:11:47.664978      11 lease_manager.go:111] The object count for lease 39cd77167795bb81 is large: 19864
kube-apiserver.log-20210118-1610997908.gz:I0118 17:12:47.683576      11 lease_manager.go:111] The object count for lease 39cd771677961958 is large: 18631
kube-apiserver.log-20210118-1610997908.gz:I0118 17:13:47.704202      11 lease_manager.go:111] The object count for lease 39cd771677964def is large: 10401
kube-apiserver.log-20210118-1610997908.gz:I0118 17:31:58.161651      11 lease_manager.go:111] The object count for lease 39cd77167796a0ba is large: 18819
kube-apiserver.log-20210118-1610997908.gz:I0118 17:32:58.171436      11 lease_manager.go:111] The object count for lease 39cd77167796db2d is large: 14581
kube-apiserver.log-20210118-1610997908.gz:I0118 17:33:58.173026      11 lease_manager.go:111] The object count for lease 39cd771677970f77 is large: 13124
kube-apiserver.log-20210118-1610997908.gz:I0118 17:34:58.175166      11 lease_manager.go:111] The object count for lease 39cd771677974985 is large: 14612
kube-apiserver.log-20210118-1610997908.gz:I0118 17:35:58.180260      11 lease_manager.go:111] The object count for lease 39cd7716779785c3 is large: 15209
(...)

So overall, first two leases are quite large (50k-70k objects), other are around 10k-15k.

The clusterloader (the creating workload on the machine) has started at '17:15:53.168001', so this means that two first large leases have been created by cluster initialization logic and not tested workload.

In my opinion we should try playing with '--lease-reuse-duration-seconds' first to try to spread leases among one minute (e.g. assuming even distribution of events, setting 10s should split first large lease 6x, so to ~10k which should be reasonable).

k8s-ci-robot added a commit that referenced this pull request Apr 8, 2021
…#97480-#98257-upstream-release-1.19

[1.19] Automated cherry pick of fixes for "large leases overload event etcd" issue (96836)
k8s-ci-robot added a commit that referenced this pull request Apr 8, 2021
…#97480-#98257-upstream-release-1.20

[1.20] Automated cherry pick of fixes for "large leases overload event etcd" issue (96836)
k8s-ci-robot added a commit that referenced this pull request Apr 8, 2021
…#97480-#98257-upstream-release-1.18

[1.18] Automated cherry pick of fixes for "large leases overload event etcd" issue (96836)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants