Try yet again to add metrics about LIST handling #104983

MikeSpreitzer · 2021-09-13T22:20:27Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds metrics about how LIST requests were handled.
Some LIST requests take a lot more CPU per unit of execution duration than most requests, making concurrency limiters (the max-in-flight filter, the API Priority and Fairness (APF) filter) not very effective at managing the CPU usage of apiservers.
We are modifying the APF filter to recognize that some LIST requests are exceptionally costly, but:

we need data to tune this, and
we want to be able to get useful data from customers in the field.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This is an alternative to #104723 and is built on #104981 . That is, review of #104723 suggested doing the last bit of plumbing in a separate PR, which I just created (#104981). The PR at hand includes #104981 and will be re-based onto master once #104981 merges.

Does this PR introduce a user-facing change?

The kube-apiserver's Prometheus metrics have been extended with some that describe the costs of handling LIST requests.  They are as follows.
- *apiserver_cache_list_total*: Counter of LIST requests served from watch cache, broken down by resource_prefix and index_name
- *apiserver_cache_list_fetched_objects_total*: Counter of objects read from watch cache in the course of serving a LIST request, broken down by resource_prefix and index_name
- *apiserver_cache_list_evaluated_objects_total*: Counter of objects tested in the course of serving a LIST request from watch cache, broken down by resource_prefix
- *apiserver_cache_list_returned_objects_total*: Counter of objects returned for a LIST request from watch cache, broken down by resource_prefix
- *apiserver_storage_list_total*: Counter of LIST requests served from etcd, broken down by resource
- *apiserver_storage_list_fetched_objects_total*: Counter of objects read from etcd in the course of serving a LIST request, broken down by resource
- *apiserver_storage_list_evaluated_objects_total*: Counter of objects tested in the course of serving a LIST request from etcd, broken down by resource
- *apiserver_storage_list_returned_objects_total*: Counter of objects returned for a LIST request from etcd, broken down by resource

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

MikeSpreitzer · 2021-09-13T22:21:56Z

@kubernetes/sig-api-machinery-misc
/sig instrumentation
/cc @wojtek-t
/cc @logicalhan

MikeSpreitzer · 2021-09-13T22:22:34Z

/cc @dgrisonnet

MikeSpreitzer · 2021-09-14T02:28:26Z

To quantify the amount of metrics added, I used local-up-cluster and took a sample.

mspreitz@ubu20:~/go/src/k8s.io/kubernetes$ kubectl get --raw /metrics > /tmp/m.txt

mspreitz@ubu20:~/go/src/k8s.io/kubernetes$ wc /tmp/m.txt 
  12284   26453 1770029 /tmp/m.txt

mspreitz@ubu20:~/go/src/k8s.io/kubernetes$ grep -c apiserver_storage_list_ /tmp/m.txt
204

mspreitz@ubu20:~/go/src/k8s.io/kubernetes$ grep -c apiserver_cache_list_ /tmp/m.txt
198

wojtek-t · 2021-09-14T12:09:50Z

/assign

@dgrisonnet
/assign @jan--f @ehashman

k8s-ci-robot · 2021-09-14T12:09:51Z

@wojtek-t: GitHub didn't allow me to assign the following users: jan--f.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign

@dgrisonnet
/assign @jan--f @ehashman

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wojtek-t · 2021-09-14T12:11:50Z

staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go

@@ -716,9 +720,10 @@ func (c *Cacher) List(ctx context.Context, key string, opts storage.ListOptions,
 	if listVal.Kind() != reflect.Slice {
 		return fmt.Errorf("need a pointer to slice, got %v", listVal.Kind())
 	}
+	var numEvals int


I'm still not convinced we want to have that for cacher - this will always be equal to the number of fetched.

So I would personally remove that metric.

wojtek-t · 2021-09-14T12:14:13Z

staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go

@@ -724,13 +727,22 @@ func (s *store) List(ctx context.Context, key string, opts storage.ListOptions,
 	var lastKey []byte
 	var hasMore bool
 	var getResp *clientv3.GetResponse
+	var numRangeQueries int


Please remove - it's not used anywhere

MikeSpreitzer · 2021-09-14T14:24:57Z

The force-push to 5759a2fd672 fixes the oversights noted above and adds the intended caching of the string form of the schema.GroupResource in the store struct.

fedebongio · 2021-09-14T20:17:15Z

/triage accepted

wojtek-t

I'm fine with this PR modulo one minor comment that I added.

But I would like to hear from sig-instrumentation folks before approving.
@dgrisonnet @logicalhan @ehashman - thoughts?

staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go

dgrisonnet

One concern from my side regarding the index label.

dgrisonnet · 2021-09-15T14:09:30Z

staging/src/k8s.io/apiserver/pkg/storage/cacher/metrics/metrics.go

+	listCacheCount = compbasemetrics.NewCounterVec(
+		&compbasemetrics.CounterOpts{
+			Name:           "apiserver_cache_list_total",
+			Help:           "Number of LIST requests served from watch cache",
+			StabilityLevel: compbasemetrics.ALPHA,
+		},
+		[]string{"resource_prefix", "index"},
+	)
+	listCacheNumFetched = compbasemetrics.NewCounterVec(
+		&compbasemetrics.CounterOpts{
+			Name:           "apiserver_cache_list_fetched_objects_total",
+			Help:           "Number of objects read from watch cache in the course of serving a LIST request",
+			StabilityLevel: compbasemetrics.ALPHA,
+		},
+		[]string{"resource_prefix", "index"},


What would be the use case for the index label? I can see how these metrics can be useful per resource, but I don't think that having the index used to filter the list is useful to the consumers of these metrics. Also, without the actual value used when matching the index, it is difficult to come up with actual conclusions since based on the value, different objects will be fetched and returned.

The use case is to understand the impact of indexing on the costs of handling LIST requests. Just knowing the index name lets us know which indices (if any) are proving helpful.

I would love to have histograms instead of counters, but that was rejected.

I have added a commit that dares to add one histogram. Perhaps that is not too much to ask?

Considering the insights you are trying to get from the index label, it makes sense to me to keep it.

Add metrics that illuminate the costs of handling LIST requests.

MikeSpreitzer · 2021-09-15T21:18:54Z

The force-push to bf42429 adds the comment suggested by @wojtek-t above.

MikeSpreitzer · 2021-09-16T06:11:00Z

BTW, I just noticed that storage/cacher already had some metrics. I missed it because they are supported directly from that package rather than from a distinct metrics package.

MikeSpreitzer · 2021-09-16T06:23:08Z

With the latest revision and hack/local-up-cluster.sh:

mspreitz@ubu20:~/go/src/k8s.io/kubernetes$ kubectl get --raw /metrics > /tmp/m.txt
mspreitz@ubu20:~/go/src/k8s.io/kubernetes$ wc /tmp/m.txt
  13659   29160 2000336 /tmp/m.txt
mspreitz@ubu20:~/go/src/k8s.io/kubernetes$ grep -c apiserver_cache_list_ /tmp/m.txt
531
mspreitz@ubu20:~/go/src/k8s.io/kubernetes$ grep -c apiserver_storage_list_ /tmp/m.txt 
204

wojtek-t · 2021-09-16T09:04:25Z

staging/src/k8s.io/apiserver/pkg/storage/cacher/metrics/metrics.go

+ * the metric stability policy.
+ */
+var (
+	listCacheNumFetched = compbasemetrics.NewHistogramVec(


This looks a bit strange to me. I mean - not the fact that you want histogram, but the fact that we have 1 histogram and N counters. It just looks very inconsistent to me.

If you really believe we need a histogram (I'm personally not convinced - I think per-time-interval delta is what we need here which we get from counter), then my question is why counters are enough in other cases.

I think that "either all or nothing" is the way to proceed.

I break usage scenarios down into two broad categories, profiling by developers and experience by real users in the wild. For the former, it is relatively easy to get the information we want from controlled experiments and count metrics. For metrics collected in the wild, there was only one traffic mix when a problem happens so it is more difficult to draw conclusions. Not completely impossible, one can compare with metrics from more normal situations. But more difficult. So I do not see it as a black-and-white issue, rather a matter of degree. And there is the contrary consideration of the volume of metrics, which is also a matter of degree. So I see our task here as balancing two matters of degree.

I am also not convinced that having histograms is meaningful here since it lacks precision compared to what we had before with counters. To me with these metrics you are interested in knowing exactly how many objects are fetched in the cache over a certain time frame based on their resource and index to measure the efficiency of indexing, which is something that you won't know from the histograms.

The histogram gives strictly more information than the two counters it replaced. Those two counters have the same content as the _sum and _count time series in the histogram, and the buckets add a rough breakdown.

For metrics collected in the wild, there was only one traffic mix when a problem happens so it is more difficult to draw conclusions. Not completely impossible, one can compare with metrics from more normal situations. But more difficult. So I do not see it as a black-and-white issue, rather a matter of degree.

Thinking more about this...
My question is - how one would one use those additional information from the histogram. They can quickly say if there were fewer larger lists or more smaller one. And what do they get from it? How to use that information?
So I don't think it's just a matter of degree - it's a matter of how to use those metrics.
So unless you have a clear case on your mind on how exactly to use that additional information (note that we already have metrics for number of particular calls, so users would already know "the average size of a request") - I'm actually against this histogram and I would revert to the previous version.

The information in the buckets gives a rough idea of how effective the indices are.

This isn't particularly useful information, because we have all the data to answer this question already.
There are exactly 4 indices:

pods by spec.NodeName [we know how many pods on a given node is]

secrets by metadata.Name [always exactly 1]

configmaps by metadata.Name [always exactly 1]

nodes by metadata.Name [always exactly 1]
Users have no way to influence that - you need to update k8s code and recompile to be able to change that. So if someone is doing that - they can also add metrics if they really need.
Also - the index is a label in your metric.

The buckets also give rough information on how the load is divided among resources.

resource_prefix is a label - so we have that information independently if this is histogram or not.

So I actually don't but the above argument.

Oh, interesting, I did not know those specifics about the actually existing indices.

On the second part, sorry, I botched what I was trying to say. The idea is that for a given resource, there can be a variety of observations depending on other factors (such as populations in namespaces). The buckets give some rough information about the distribution of that variety.

On the second part, sorry, I botched what I was trying to say. The idea is that for a given resource, there can be a variety of observations depending on other factors (such as populations in namespaces). The buckets give some rough information about the distribution of that variety.

Sure - but the question is how we can use that information. Let's say that instead of knowing that there were 100,000 object fetches I now that there were 100 of size 100 and 9 of size 10,000. What I can use that information for?

It can get useful in combination with other information. For one example, if I am investigating a situation with a few spikes in LIST latency for a particular resource, that information helps me narrow down my search for the cause(s). For another example, if I am trying to understand the distribution of LIST latencies for a given resource for the sake of tuning our coefficients in API Priority and Fairness, knowing the distribution of numbers fetched helps with that.

Those are somewhat hypothetical examples to me. Metrics indeed try to pin-point the approximate area of the problem, but it's really help to debug something purely on metrics. And if we pinpoint the latency of LIST requests of a particular resource in a particular point in time, we have logs (and traces in the logs) that would allow us to actually debug it.

For the tuning example - what you described sounds a bit like an attempt to reverse-engineer the expected charactestics introduced by the code from the metrics. Sure - we can do that, but is that really what we want?
Also - can we really do it if we have only partial data for it (only number of fetched objects and only for lists served from cache)? I would be much closer to buy that if we would have all 4 things being histograms [though I'm still not convinced we need that].

MikeSpreitzer · 2021-09-16T17:05:33Z

BTW, I am sure the name of the histogram metric is wrong, waiting for someone to tell me what the conventions imply the name should be (or remind me where to find the relevant conventions).

MikeSpreitzer · 2021-09-20T16:46:21Z

The force-push to bf42429 undoes the histogram addition

wojtek-t · 2021-09-20T18:31:28Z

OK - based on the comment from @dgrisonnet from above that his only concern is about index label and that was clarified in the comment thread, I'm going to approve it, but add a hold to give other people to potentially take another look.
I'm going to unhold on Wednesday morning (my time, so before PST Wednesday).
@logicalhan @ehashman ^^

/lgtm
/approve

/hold

k8s-ci-robot · 2021-09-20T18:31:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MikeSpreitzer, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/apiserver/OWNERS~~ [wojtek-t]
~~vendor/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dgrisonnet · 2021-09-21T09:18:58Z

Looks good from my side 👍

/lgtm

wojtek-t · 2021-09-22T10:43:41Z

With the above lgtm and timeout, I'm unholding.

/hold cancel

MikeSpreitzer · 2021-09-22T13:05:26Z

/retest

k8s-ci-robot requested review from hongchaodeng and logicalhan September 13, 2021 22:21

k8s-ci-robot requested a review from wojtek-t September 13, 2021 22:21

k8s-ci-robot requested a review from dgrisonnet September 13, 2021 22:22

MikeSpreitzer force-pushed the list-metrics-take3 branch from b512df0 to fb962d7 Compare September 14, 2021 02:36

MikeSpreitzer mentioned this pull request Sep 14, 2021

Add metrics about LIST handling #104723

Closed

k8s-ci-robot assigned ehashman and wojtek-t Sep 14, 2021

wojtek-t reviewed Sep 14, 2021

View reviewed changes

MikeSpreitzer force-pushed the list-metrics-take3 branch from fb962d7 to 5759a2f Compare September 14, 2021 14:24

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 14, 2021

wojtek-t reviewed Sep 15, 2021

View reviewed changes

staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go Show resolved Hide resolved

dgrisonnet reviewed Sep 15, 2021

View reviewed changes

Add metrics about handling LIST requests

bf42429

Add metrics that illuminate the costs of handling LIST requests.

MikeSpreitzer force-pushed the list-metrics-take3 branch from 5759a2f to bf42429 Compare September 15, 2021 21:18

wojtek-t reviewed Sep 16, 2021

View reviewed changes

MikeSpreitzer force-pushed the list-metrics-take3 branch from 8898d3c to bf42429 Compare September 20, 2021 16:45

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Sep 20, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 20, 2021

k8s-ci-robot assigned dgrisonnet Sep 21, 2021

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 22, 2021

k8s-ci-robot merged commit 5b489e2 into kubernetes:master Sep 22, 2021

k8s-ci-robot added this to the v1.23 milestone Sep 22, 2021

acumino mentioned this pull request Feb 8, 2022

Upgrade k8s.io/* to v0.23, sigs.k8s.io/controller-runtime to v0.11 gardener/gardener#5282

Closed

1 task

istvanballok mentioned this pull request Jun 14, 2022

Provide metrics for the network bandwidth usage between the api server and etcd gardener/monitoring#15

Open

Try yet again to add metrics about LIST handling #104983

Try yet again to add metrics about LIST handling #104983

Conversation

MikeSpreitzer commented Sep 13, 2021

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

MikeSpreitzer commented Sep 13, 2021

MikeSpreitzer commented Sep 13, 2021

MikeSpreitzer commented Sep 14, 2021

wojtek-t commented Sep 14, 2021

k8s-ci-robot commented Sep 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeSpreitzer commented Sep 14, 2021

fedebongio commented Sep 14, 2021

wojtek-t left a comment

Choose a reason for hiding this comment

dgrisonnet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeSpreitzer commented Sep 15, 2021

MikeSpreitzer commented Sep 16, 2021

MikeSpreitzer commented Sep 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeSpreitzer commented Sep 16, 2021

MikeSpreitzer commented Sep 20, 2021

wojtek-t commented Sep 20, 2021

k8s-ci-robot commented Sep 20, 2021

dgrisonnet commented Sep 21, 2021

wojtek-t commented Sep 22, 2021

MikeSpreitzer commented Sep 22, 2021