storage e2e: verify CSIStorageCapacity publishing #100537

pohly · 2021-03-24T16:32:48Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Drivers need to opt into the new test. Depending on how the driver
describes its behavior, the check can be more specific. Currently it
distinguishes between getting any kind of information about the
storage class (i.e. assuming that capacity is not exhausted) and
getting one object per node (for local storage). Discovering nodes
only works for CSI drivers.

The immediate usage of this new test is for csi-driver-host-path with
the deployment for distributed provisioning and storage capacity
tracking. Periodic kubernetes-csi Prow and pre-merge jobs can run this
test.

Special notes for your reviewer:

The alternative would have been to write a test that manages the
deployment of the csi-driver-host-path driver itself, i.e. use the E2E
manifests. But that would have implied duplicating the
deployments (in-tree and in csi-driver-host-path) and then changing
kubernetes-csi Prow jobs to somehow run for in-tree driver definition
with newer components, something that currently isn't done. The test
then also wouldn't be applicable to out-of-tree driver deployments.

Yet another alternative would be to create a separate E2E test suite
either in csi-driver-host-path or external-provisioner. The advantage
of such an approach is that the test can be written exactly for the
expected behavior of a deployment and thus be more precise than the
generic version of the test in k/k. But this again wouldn't be
reusable for other drivers and also a lot of work to set up as no such
E2E test suite currently exists.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1472-storage-capacity-tracking

pohly · 2021-03-24T16:33:10Z

/sig storage

verult · 2021-03-24T17:18:07Z

/assign

Drivers need to opt into the new test. Depending on how the driver describes its behavior, the check can be more specific. Currently it distinguishes between getting any kind of information about the storage class (i.e. assuming that capacity is not exhausted) and getting one object per node (for local storage). Discovering nodes only works for CSI drivers. The immediate usage of this new test is for csi-driver-host-path with the deployment for distributed provisioning and storage capacity tracking. Periodic kubernetes-csi Prow and pre-merge jobs can run this test. The alternative would have been to write a test that manages the deployment of the csi-driver-host-path driver itself, i.e. use the E2E manifests. But that would have implied duplicating the deployments (in-tree and in csi-driver-host-path) and then changing kubernetes-csi Prow jobs to somehow run for in-tree driver definition with newer components, something that currently isn't done. The test then also wouldn't be applicable to out-of-tree driver deployments. Yet another alternative would be to create a separate E2E test suite either in csi-driver-host-path or external-provisioner. The advantage of such an approach is that the test can be written exactly for the expected behavior of a deployment and thus be more precise than the generic version of the test in k/k. But this again wouldn't be reusable for other drivers and also a lot of work to set up as no such E2E test suite currently exists.

pohly · 2021-03-24T17:19:53Z

Fixed the spelling mistake...

We discussed this in the CSI standup meeting and agreed to consider including this in 1.21, so I removed the "WIP".

pohly · 2021-03-24T17:21:18Z

test/e2e/storage/testsuites/provisioning.go

+				err := client.StorageV1().StorageClasses().Delete(context.TODO(), computedStorageClass.Name, metav1.DeleteOptions{})
+				if err != nil && !apierrors.IsNotFound(err) {
+					framework.ExpectNoError(err, "delete storage class")
+				}


This change was necessary because the new capacity test calls the function twice, once as part of the test and then through defer. I think ignoring "NotFound" errors is a common pattern when deleting.

I understand this is making the cleanup idempotent, good idea

pohly · 2021-03-24T17:35:21Z

The corresponding PR in csi-driver-host-path is kubernetes-csi/csi-driver-host-path#265.

Here's what I did to run the test:

bring up a KinD cluster with the csi-driver-host-path branch: CSI_PROW_USE_BAZEL=false CSI_PROW_DEPLOYMENT=kubernetes-distributed CSI_PROW_TESTS=serial-alpha CSI_PROW_KUBERNETES_VERSION=latest ./.prow.sh (beware, that modifies k/k under $GOPATH!) - any other Kubernetes cluster with the v1alpha1 CSIStorageCapacity API also works
to experiment with the driver deployment (it's already installed by prow.sh, but I wanted to test various changes to test-driver.yaml): deploy/kubernetes-distributed/deploy.sh
in the k/k branch of this PR here: KUBECONFIG=/home/pohly/.kube/config go test -v ./test/e2e -ginkgo.focus=hostpath.csi.k8s.io.*provides.storage.capacity.information -storage.testdriver /nvme/gopath/src/github.com/kubernetes-csi/csi-driver-host-path/deploy/kubernetes-distributed/test-driver.yaml

mauriciopoppe · 2021-03-24T18:17:21Z

/assign

pohly · 2021-03-24T18:42:17Z

/retest

pohly · 2021-03-24T18:51:42Z

It's perhaps worth pointing out that the new test gets skipped almost instantaneously for drivers which don't support it (https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/100537/pull-kubernetes-e2e-kind/1374772707308081152) because the checks are done in SkipUnsupportedTests.

verult

First pass review on the overall concept. At first glance I don't think it's a risky test to merge today since this will be skipped for all drivers except external csi-hostpath and the test duration should be fairly short.

Just some clarifying questions around the overall Storage Capacity testing plan and how this PR fits in:

The KEP mentions "The existing raw block volume tests then can be used to ensure that pod scheduling works".
- Is this still true? So without this PR, we have some test coverage but only for the case when Storage Capacity allows scheduling to proceed, i.e. no regression. We can't actually tell whether Storage Capacity changed scheduling decisions, nor whether it prevented a pod from scheduling altogether.
The KEP then says "A new test can be written which checks for CSIStorageCapacity objects, asks for pod scheduling with a volume that is too large, and then checks for events that describe the problem."
- This PR only checks for expected CSIStorageCapacity objects, which I think is a low-impact test we can get in now to give us some coverage. Is the plan to extend this test to provision a volume and schedule a pod as well in the next release? My understanding is that ultimately we care about whether pod scheduling is rejected under capacity constraints, and a test verifying CSIStorageCapacity objects might not be strictly needed because it's testing an implementation detail, but is still good to have to ease debugging.

Regarding the different alternative test setups: with the setup in this PR, isn't it still possible to run this test using the hostPath CSI driver in k/k testing-manifests if we update its deployment spec and test capabilities (i.e. your first proposed alternative if I understand correctly)? Not saying we should but just curious if it's possible to reuse this test if we do decide to go down that route.

verult · 2021-03-24T18:13:35Z

test/e2e/storage/testsuites/capacity.go

+
+func (p *capacityTestSuite) SkipUnsupportedTests(driver storageframework.TestDriver, pattern storageframework.TestPattern) {
+	// Check preconditions.
+	if pattern.VolType != storageframework.DynamicPV {


Eventually do you plan to include inline volumes here too?

* The KEP mentions "The existing raw block volume tests then can be used to ensure that pod scheduling works". * Is this still true? So without this PR, we have _some_ test coverage but only for the case when Storage Capacity allows scheduling to proceed, i.e. no regression. We can't actually tell whether Storage Capacity changed scheduling decisions, nor whether it prevented a pod from scheduling altogether.

Correct. If we want to verify in an E2E test that scheduling really takes capacity into account, then we have to write a test that schedules pods and then somehow determines that the scheduler considered storage capacity while making its choices. This is hard in the positive case of a pod starting to run (there's little to no information recorded), but in the negative case of a pod failing to start we might be able to use the (unreliable!) events to learn about the reason.

It's much easier to write unit tests for the scheduler. That's how the functionality is already getting tested.

* The KEP then says "A new test can be written which checks for CSIStorageCapacity objects, asks for pod scheduling with a volume that is too large, and then checks for events that describe the problem."

Yes, that's the test I meant above. Adding such a test to this test suite seems doable and worthwhile. We just shouldn't run it against a real driver were we have to allocate hundreds of terabytes before it runs out of storage 😅

* This PR only checks for expected `CSIStorageCapacity` objects, which I think is a low-impact test we can get in now to give us some coverage. Is the plan to extend this test to provision a volume and schedule a pod as well in the next release?

Yes. I also want to add checking of maximum volume size.

My understanding is that ultimately we care about whether pod scheduling is rejected under capacity constraints, and a test verifying CSIStorageCapacity objects might not be strictly needed because it's testing an implementation detail, but is still good to have to ease debugging.

This particular test was motivated by kubernetes-csi/external-provisioner#590 which fixes an issue where too many CSIStorageCapacity objects were created. The failure was apparently random and even though the fix hasn't been merged yet, this test here doesn't trigger it. But the failure showed that there is a need for E2E testing because unit testing would never have exercised that faulty code.

Regarding the different alternative test setups: with the setup in this PR, isn't it still possible to run this test using the hostPath CSI driver in k/k testing-manifests if we update its deployment spec and test capabilities (i.e. your first proposed alternative if I understand correctly)?

Yes, that would be possible. But as the test in its current form doesn't exercise any functionality in Kubernetes, it's not worthwhile now. That will change once we add tests that involve the scheduler.

Eventually do you plan to include inline volumes here too?

That will make sense once the test covers scheduling.

verult · 2021-03-24T20:37:41Z

test/e2e/storage/testsuites/capacity.go

+			if len(dInfo.TopologyKeys) != 1 {
+				framework.Failf("need exactly one topology key for local storage, DriverInfo.TopologyKeys has: %v", dInfo.TopologyKeys)
+			}
+			matcher = HaveCapacitiesForClassAndNodes(f.ClientSet, sc.Provisioner, sc.Name, dInfo.TopologyKeys[0])


Thinking about ways to make capacities matchers more generic instead of special casing local storage: Instead of having CapCSILocalStorage, what if we pass len(dInfo.TopologyKeys) into HaveCapacitiesForClassAndNodes(), and inside that function set TopologyKey to that one entry if len() is 1, otherwise ignore?

Ah I realized this alone wouldn't work because the local storage matcher implementation assumes a single topology key implies distinct topology values for each node. Looking deeper.

verult · 2021-03-24T20:50:36Z

test/e2e/storage/testsuites/capacity.go

+					node.Name,
+					h.topologyKey)
+			}
+			h.topologyValues = append(h.topologyValues, value)


If topologyValues is changed to a map, would this matcher then work for arbitrary topologies with 1 topology key? Then you could remove the local storage capability

Yes, I think that could work. Let me give it a try.

I think that works. I've pushed that as a separate commit, just in case that we later find out that there was some logical flaw after all and want to revert to the original test.

"I think" = "I tested it".

The logical extension of that is to support multiple topology keys. But then the test code becomes as complex as the corresponding code external-provisioner and if I simply copy that code, then the same bug will be in both... let's consider this again once we actually have a storage driver with multiple keys.

verult · 2021-03-24T20:52:33Z

test/e2e/storage/testsuites/capacity.go

+			if len(dInfo.TopologyKeys) != 1 {
+				framework.Failf("need exactly one topology key for local storage, DriverInfo.TopologyKeys has: %v", dInfo.TopologyKeys)
+			}
+			matcher = HaveCapacitiesForClassAndNodes(f.ClientSet, sc.Provisioner, sc.Name, dInfo.TopologyKeys[0])


Ah I realized this alone wouldn't work because the local storage matcher implementation assumes a single topology key implies distinct topology values for each node. Looking deeper.

verult · 2021-03-24T21:31:12Z

test/e2e/storage/testsuites/capacity.go

+		return
+	}
+
+	// Find all nodes on which the driver runs.


This looks similar to the actual algorithm for computing CSIStorageCapacity. Are we running into the risk of reimplementing the logic we are testing here?

Ideally we should compute the number of expected objects based on configured expectations but that might be hard (for example to configure the topology segment distribution over nodes) and the existing test framework may not give enough information today to do this computation.

So if it's indeed close enough to a reimplementation, which would be hard to maintain, does it make sense to require that all LocalStorage drivers are expected to be installed on every node, and just use the node count to compute the expected number of Storage Capacity objects? The tradeoff here is some types of drivers will be excluded, but maybe this is necessary if (1) we don't want to reimplement the logic we are testing and (2) we can't configure expectations more easily.

This looks similar to the actual algorithm for computing CSIStorageCapacity. Are we running into the risk of reimplementing the logic we are testing here?

Saw that after writing my own comment in #100537 (comment) - yes, that's exactly the risk. The "let's do a special case for local storage" approach came from the desire to keep the test code simple and easy to review. I also expect that most drivers which use that will be for local storage, so the test would already be useful with that limitation.

Ideally we should compute the number of expected objects based on configured expectations but that might be hard (for example to configure the topology segment distribution over nodes) and the existing test framework may not give enough information today to do this computation.

Correct. The driver deployment itself would have to compute and configure this, which would be complex.

So if it's indeed close enough to a reimplementation, which would be hard to maintain, does it make sense to require that all LocalStorage drivers are expected to be installed on every node, and just use the node count to compute the expected number of Storage Capacity objects?

I find that too limiting. PMEM-CSI cannot necessarily run on all nodes. I also don't find checking of the CSINode objects that hard or complex.

We can support topology detection for all drivers with a single topology key by allowing different nodes to have the same topology value. This makes the test a bit more generic and its configuration simpler.

msau42 · 2021-03-24T23:47:56Z

/approve
@verult gets final lgtm

k8s-ci-robot · 2021-03-24T23:48:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e/storage/OWNERS~~ [msau42]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

verult · 2021-03-24T23:50:08Z

/triage accepted
/lgtm

verult · 2021-03-24T23:50:41Z

/test pull-kubernetes-e2e-kind-ipv6

verult · 2021-03-24T23:51:44Z

/priority important-soon

msau42 · 2021-03-24T23:52:56Z

/milestone v1.21

verult · 2021-03-25T03:15:09Z

I pinged Patrick offline, but for record keeping, here were some additional thoughts for this test:

(Per @msau42 's suggestion) I was thinking about whether this test is more for testing K8s integration logic or driver logic. AFAIU the test suites framework here is mostly to test various drivers. My current take is this particular test has a mix of both. In the end it's testing how the provisioner creates storage capacity objects given what storageclasses and topology segments are available. The latter information is different depending on driver, so at least in part it's testing drivers. If we opt to use test framework to test k8s integration instead, using csi-mock or host-path or some combination of the two in the future, the topology patterns tested are limited. So it makes sense to keep this test where it is now in the PR.

Based on @pohly 's comments I think the current form makes sense, given that (1) this test is necessary to catch a previously discovered error; (2) the logic is simple enough and not as complicated as actual implementation because multi-topology-key test logic is (intentionally) missing; (3) it's too hard to use the existing framework to write a expectation-based test

pohly · 2021-03-25T08:10:15Z

(Per @msau42 's suggestion) I was thinking about whether this test is more for testing K8s integration logic or driver logic. AFAIU the test suites framework here is mostly to test various drivers. My current take is this particular test has a mix of both. In the end it's testing how the provisioner creates storage capacity objects given what storageclasses and topology segments are available. The latter information is different depending on driver, so at least in part it's testing drivers.

Exactly. For third-party driver developers the test provides more assurance that their driver deployment actually produces correct CSIStorageCapacity objects, instead of relying on indirect confirmation because pod scheduling with late binding isn't breaking.

(3) it's too hard to use the existing framework to write a expectation-based test

True. I was considering further fields in DriverInfo to describe the driver in more detail:

"linear storage" - creating a volume of size "x" for storage class "Y" reduces the available capacity by exactly that amount
"expected capacity per storage class" - this is what the reported value should be initially

I decided against both because even LVM drivers might not be perfectly linear, thus making this only useful for the fake storage capacity in the hostpath driver, and parallel testing or volume leaks from other tests would break the expected capacity value checking.

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 24, 2021

k8s-ci-robot requested review from davidz627 and jingxu97 March 24, 2021 16:33

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Mar 24, 2021

pohly mentioned this pull request Mar 24, 2021

storage capacity e2e test kubernetes-csi/csi-driver-host-path#265

Merged

k8s-ci-robot assigned verult Mar 24, 2021

pohly force-pushed the storage-capacity-e2e-test branch from 3bff619 to b9b5d13 Compare March 24, 2021 17:18

pohly changed the title ~~WIP: storage e2e: verify CSIStorageCapacity publishing~~ storage e2e: verify CSIStorageCapacity publishing Mar 24, 2021

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 24, 2021

pohly commented Mar 24, 2021

View reviewed changes

k8s-ci-robot assigned mauriciopoppe Mar 24, 2021

verult reviewed Mar 24, 2021

View reviewed changes

storage e2e: simplify CSIStorageCapacity test

d7a086d

We can support topology detection for all drivers with a single topology key by allowing different nodes to have the same topology value. This makes the test a bit more generic and its configuration simpler.

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 24, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 24, 2021

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 24, 2021

k8s-ci-robot added this to the v1.21 milestone Mar 24, 2021

k8s-ci-robot merged commit e34046c into kubernetes:master Mar 25, 2021

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage e2e: verify CSIStorageCapacity publishing #100537

storage e2e: verify CSIStorageCapacity publishing #100537

pohly commented Mar 24, 2021 •

edited by liggitt

pohly commented Mar 24, 2021

verult commented Mar 24, 2021

pohly commented Mar 24, 2021

pohly Mar 24, 2021

mauriciopoppe Mar 24, 2021

pohly commented Mar 24, 2021 •

edited

mauriciopoppe commented Mar 24, 2021

pohly commented Mar 24, 2021

pohly commented Mar 24, 2021

verult left a comment •

edited

verult Mar 24, 2021

pohly Mar 24, 2021

verult Mar 24, 2021 •

edited

verult Mar 24, 2021

verult Mar 24, 2021 •

edited

pohly Mar 24, 2021

pohly Mar 24, 2021

pohly Mar 24, 2021

pohly Mar 24, 2021

verult Mar 24, 2021

verult Mar 24, 2021

pohly Mar 24, 2021

msau42 commented Mar 24, 2021

k8s-ci-robot commented Mar 24, 2021

verult commented Mar 24, 2021

verult commented Mar 24, 2021

verult commented Mar 24, 2021

msau42 commented Mar 24, 2021

verult commented Mar 25, 2021

pohly commented Mar 25, 2021

storage e2e: verify CSIStorageCapacity publishing #100537

storage e2e: verify CSIStorageCapacity publishing #100537

Conversation

pohly commented Mar 24, 2021 • edited by liggitt

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

pohly commented Mar 24, 2021

verult commented Mar 24, 2021

pohly commented Mar 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly commented Mar 24, 2021 • edited

mauriciopoppe commented Mar 24, 2021

pohly commented Mar 24, 2021

pohly commented Mar 24, 2021

verult left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

verult Mar 24, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

verult Mar 24, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Mar 24, 2021

k8s-ci-robot commented Mar 24, 2021

verult commented Mar 24, 2021

verult commented Mar 24, 2021

verult commented Mar 24, 2021

msau42 commented Mar 24, 2021

verult commented Mar 25, 2021

pohly commented Mar 25, 2021

pohly commented Mar 24, 2021 •

edited by liggitt

pohly commented Mar 24, 2021 •

edited

verult left a comment •

edited

verult Mar 24, 2021 •

edited

verult Mar 24, 2021 •

edited