Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try yet again to add metrics about LIST handling #104983

Merged
merged 1 commit into from Sep 22, 2021

Conversation

MikeSpreitzer
Copy link
Member

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds metrics about how LIST requests were handled.
Some LIST requests take a lot more CPU per unit of execution duration than most requests, making concurrency limiters (the max-in-flight filter, the API Priority and Fairness (APF) filter) not very effective at managing the CPU usage of apiservers.
We are modifying the APF filter to recognize that some LIST requests are exceptionally costly, but:

  1. we need data to tune this, and
  2. we want to be able to get useful data from customers in the field.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This is an alternative to #104723 and is built on #104981 . That is, review of #104723 suggested doing the last bit of plumbing in a separate PR, which I just created (#104981). The PR at hand includes #104981 and will be re-based onto master once #104981 merges.

Does this PR introduce a user-facing change?

The kube-apiserver's Prometheus metrics have been extended with some that describe the costs of handling LIST requests.  They are as follows.
- *apiserver_cache_list_total*: Counter of LIST requests served from watch cache, broken down by resource_prefix and index_name
- *apiserver_cache_list_fetched_objects_total*: Counter of objects read from watch cache in the course of serving a LIST request, broken down by resource_prefix and index_name
- *apiserver_cache_list_evaluated_objects_total*: Counter of objects tested in the course of serving a LIST request from watch cache, broken down by resource_prefix
- *apiserver_cache_list_returned_objects_total*: Counter of objects returned for a LIST request from watch cache, broken down by resource_prefix
- *apiserver_storage_list_total*: Counter of LIST requests served from etcd, broken down by resource
- *apiserver_storage_list_fetched_objects_total*: Counter of objects read from etcd in the course of serving a LIST request, broken down by resource
- *apiserver_storage_list_evaluated_objects_total*: Counter of objects tested in the course of serving a LIST request from etcd, broken down by resource
- *apiserver_storage_list_returned_objects_total*: Counter of objects returned for a LIST request from etcd, broken down by resource

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Sep 13, 2021
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 13, 2021
@MikeSpreitzer
Copy link
Member Author

@kubernetes/sig-api-machinery-misc
/sig instrumentation
/cc @wojtek-t
/cc @logicalhan

@MikeSpreitzer
Copy link
Member Author

/cc @dgrisonnet

@MikeSpreitzer
Copy link
Member Author

To quantify the amount of metrics added, I used local-up-cluster and took a sample.

mspreitz@ubu20:~/go/src/k8s.io/kubernetes$ kubectl get --raw /metrics > /tmp/m.txt

mspreitz@ubu20:~/go/src/k8s.io/kubernetes$ wc /tmp/m.txt 
  12284   26453 1770029 /tmp/m.txt

mspreitz@ubu20:~/go/src/k8s.io/kubernetes$ grep -c apiserver_storage_list_ /tmp/m.txt
204

mspreitz@ubu20:~/go/src/k8s.io/kubernetes$ grep -c apiserver_cache_list_ /tmp/m.txt
198

@wojtek-t
Copy link
Member

/assign

@dgrisonnet
/assign @jan--f @ehashman

@k8s-ci-robot
Copy link
Contributor

@wojtek-t: GitHub didn't allow me to assign the following users: jan--f.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign

@dgrisonnet
/assign @jan--f @ehashman

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -716,9 +720,10 @@ func (c *Cacher) List(ctx context.Context, key string, opts storage.ListOptions,
if listVal.Kind() != reflect.Slice {
return fmt.Errorf("need a pointer to slice, got %v", listVal.Kind())
}
var numEvals int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced we want to have that for cacher - this will always be equal to the number of fetched.

So I would personally remove that metric.

@@ -724,13 +727,22 @@ func (s *store) List(ctx context.Context, key string, opts storage.ListOptions,
var lastKey []byte
var hasMore bool
var getResp *clientv3.GetResponse
var numRangeQueries int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove - it's not used anywhere

@MikeSpreitzer
Copy link
Member Author

The force-push to 5759a2fd672 fixes the oversights noted above and adds the intended caching of the string form of the schema.GroupResource in the store struct.

@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 14, 2021
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this PR modulo one minor comment that I added.

But I would like to hear from sig-instrumentation folks before approving.
@dgrisonnet @logicalhan @ehashman - thoughts?

Copy link
Member

@dgrisonnet dgrisonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern from my side regarding the index label.

Comment on lines +35 to +49
listCacheCount = compbasemetrics.NewCounterVec(
&compbasemetrics.CounterOpts{
Name: "apiserver_cache_list_total",
Help: "Number of LIST requests served from watch cache",
StabilityLevel: compbasemetrics.ALPHA,
},
[]string{"resource_prefix", "index"},
)
listCacheNumFetched = compbasemetrics.NewCounterVec(
&compbasemetrics.CounterOpts{
Name: "apiserver_cache_list_fetched_objects_total",
Help: "Number of objects read from watch cache in the course of serving a LIST request",
StabilityLevel: compbasemetrics.ALPHA,
},
[]string{"resource_prefix", "index"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the use case for the index label? I can see how these metrics can be useful per resource, but I don't think that having the index used to filter the list is useful to the consumers of these metrics. Also, without the actual value used when matching the index, it is difficult to come up with actual conclusions since based on the value, different objects will be fetched and returned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case is to understand the impact of indexing on the costs of handling LIST requests. Just knowing the index name lets us know which indices (if any) are proving helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to have histograms instead of counters, but that was rejected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a commit that dares to add one histogram. Perhaps that is not too much to ask?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the insights you are trying to get from the index label, it makes sense to me to keep it.

Add metrics that illuminate the costs of handling LIST requests.
@MikeSpreitzer
Copy link
Member Author

The force-push to bf42429 adds the comment suggested by @wojtek-t above.

@MikeSpreitzer
Copy link
Member Author

BTW, I just noticed that storage/cacher already had some metrics. I missed it because they are supported directly from that package rather than from a distinct metrics package.

@MikeSpreitzer
Copy link
Member Author

With the latest revision and hack/local-up-cluster.sh:

mspreitz@ubu20:~/go/src/k8s.io/kubernetes$ kubectl get --raw /metrics > /tmp/m.txt
mspreitz@ubu20:~/go/src/k8s.io/kubernetes$ wc /tmp/m.txt
  13659   29160 2000336 /tmp/m.txt
mspreitz@ubu20:~/go/src/k8s.io/kubernetes$ grep -c apiserver_cache_list_ /tmp/m.txt
531
mspreitz@ubu20:~/go/src/k8s.io/kubernetes$ grep -c apiserver_storage_list_ /tmp/m.txt 
204

* the metric stability policy.
*/
var (
listCacheNumFetched = compbasemetrics.NewHistogramVec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit strange to me. I mean - not the fact that you want histogram, but the fact that we have 1 histogram and N counters. It just looks very inconsistent to me.

If you really believe we need a histogram (I'm personally not convinced - I think per-time-interval delta is what we need here which we get from counter), then my question is why counters are enough in other cases.

I think that "either all or nothing" is the way to proceed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I break usage scenarios down into two broad categories, profiling by developers and experience by real users in the wild. For the former, it is relatively easy to get the information we want from controlled experiments and count metrics. For metrics collected in the wild, there was only one traffic mix when a problem happens so it is more difficult to draw conclusions. Not completely impossible, one can compare with metrics from more normal situations. But more difficult. So I do not see it as a black-and-white issue, rather a matter of degree. And there is the contrary consideration of the volume of metrics, which is also a matter of degree. So I see our task here as balancing two matters of degree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not convinced that having histograms is meaningful here since it lacks precision compared to what we had before with counters. To me with these metrics you are interested in knowing exactly how many objects are fetched in the cache over a certain time frame based on their resource and index to measure the efficiency of indexing, which is something that you won't know from the histograms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The histogram gives strictly more information than the two counters it replaced. Those two counters have the same content as the _sum and _count time series in the histogram, and the buckets add a rough breakdown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For metrics collected in the wild, there was only one traffic mix when a problem happens so it is more difficult to draw conclusions. Not completely impossible, one can compare with metrics from more normal situations. But more difficult. So I do not see it as a black-and-white issue, rather a matter of degree.

Thinking more about this...
My question is - how one would one use those additional information from the histogram. They can quickly say if there were fewer larger lists or more smaller one. And what do they get from it? How to use that information?
So I don't think it's just a matter of degree - it's a matter of how to use those metrics.
So unless you have a clear case on your mind on how exactly to use that additional information (note that we already have metrics for number of particular calls, so users would already know "the average size of a request") - I'm actually against this histogram and I would revert to the previous version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The information in the buckets gives a rough idea of how effective the indices are.

This isn't particularly useful information, because we have all the data to answer this question already.
There are exactly 4 indices:

  • pods by spec.NodeName [we know how many pods on a given node is]
  • secrets by metadata.Name [always exactly 1]
  • configmaps by metadata.Name [always exactly 1]
  • nodes by metadata.Name [always exactly 1]
    Users have no way to influence that - you need to update k8s code and recompile to be able to change that. So if someone is doing that - they can also add metrics if they really need.
    Also - the index is a label in your metric.

The buckets also give rough information on how the load is divided among resources.

resource_prefix is a label - so we have that information independently if this is histogram or not.

So I actually don't but the above argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting, I did not know those specifics about the actually existing indices.

On the second part, sorry, I botched what I was trying to say. The idea is that for a given resource, there can be a variety of observations depending on other factors (such as populations in namespaces). The buckets give some rough information about the distribution of that variety.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the second part, sorry, I botched what I was trying to say. The idea is that for a given resource, there can be a variety of observations depending on other factors (such as populations in namespaces). The buckets give some rough information about the distribution of that variety.

Sure - but the question is how we can use that information. Let's say that instead of knowing that there were 100,000 object fetches I now that there were 100 of size 100 and 9 of size 10,000. What I can use that information for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can get useful in combination with other information. For one example, if I am investigating a situation with a few spikes in LIST latency for a particular resource, that information helps me narrow down my search for the cause(s). For another example, if I am trying to understand the distribution of LIST latencies for a given resource for the sake of tuning our coefficients in API Priority and Fairness, knowing the distribution of numbers fetched helps with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are somewhat hypothetical examples to me. Metrics indeed try to pin-point the approximate area of the problem, but it's really help to debug something purely on metrics. And if we pinpoint the latency of LIST requests of a particular resource in a particular point in time, we have logs (and traces in the logs) that would allow us to actually debug it.

For the tuning example - what you described sounds a bit like an attempt to reverse-engineer the expected charactestics introduced by the code from the metrics. Sure - we can do that, but is that really what we want?
Also - can we really do it if we have only partial data for it (only number of fetched objects and only for lists served from cache)? I would be much closer to buy that if we would have all 4 things being histograms [though I'm still not convinced we need that].

@MikeSpreitzer
Copy link
Member Author

BTW, I am sure the name of the histogram metric is wrong, waiting for someone to tell me what the conventions imply the name should be (or remind me where to find the relevant conventions).

@MikeSpreitzer
Copy link
Member Author

The force-push to bf42429 undoes the histogram addition

@wojtek-t
Copy link
Member

OK - based on the comment from @dgrisonnet from above that his only concern is about index label and that was clarified in the comment thread, I'm going to approve it, but add a hold to give other people to potentially take another look.
I'm going to unhold on Wednesday morning (my time, so before PST Wednesday).
@logicalhan @ehashman ^^

/lgtm
/approve

/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Sep 20, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MikeSpreitzer, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 20, 2021
@dgrisonnet
Copy link
Member

Looks good from my side 👍

/lgtm

@wojtek-t
Copy link
Member

With the above lgtm and timeout, I'm unholding.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 22, 2021
@MikeSpreitzer
Copy link
Member Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants