Add e2e test to validate performance metrics of volume lifecycle operations #94334

RaunakShah · 2020-08-29T20:14:18Z

What type of PR is this?
/kind feature
(Not sure if a test counts as a feature)

What this PR does / why we need it:
This test validates latency and throughput of volume
provisioning against a high baseline.

Test Steps:

Create a Storage Class
Set up an informer on PVCs
Create 500 PVCs

Store create time for each PVC

Wait for all PVCs to enter Bound state

Use informer to store time when every PVC enters Bound state.

End wait if:

All PVCs enter Bound state (or)
Test times out

Calculate performance metrics based on above times
Validate performance metrics against baseline.

Future work: Add more metrics and calculate performance metrics for other types of operations.

Which issue(s) this PR fixes:

Follow up to comments from @msau42 based on some performance improvements we've made to external sidecars.

Special notes for your reviewer:
Baselines are set to quite high values. Not sure what we should keep it as right now..
Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2020-08-29T20:14:26Z

Hi @RaunakShah. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

RaunakShah · 2020-08-29T20:14:40Z

cc: @msau42 @xing-yang

eddiezane · 2020-08-31T18:16:32Z

/ok-to-test

RaunakShah · 2020-08-31T18:36:14Z

/test pull-kubernetes-bazel-test

xing-yang · 2020-09-01T04:14:05Z

/assign

xing-yang · 2020-09-01T04:14:16Z

/assign @msau42

test/e2e/storage/testsuites/volumeperf.go

xing-yang · 2020-09-20T02:21:07Z

/retest

xing-yang · 2020-09-21T19:08:59Z

@msau42, you mentioned several different types of stress tests. Can you take a look of this PR?

xing-yang · 2020-10-05T17:28:44Z

/assign @pohly

test/e2e/storage/testsuites/volumeperf.go

RaunakShah · 2021-01-20T19:45:25Z

@msau42 yes! Do we want to merge this PR and then add some follow up?

pohly · 2021-02-09T09:19:36Z

Here's a quick comparison between running tests via e2e.test (this PR) and Cluster Loader 2 = CL2. For an example of using the latter, see https://github.com/kubernetes/perf-tests/blob/master/adhoc/log/distributed-provisioning.md

Advantages of CL2:

tests are divided into steps, with timing and success recorded in a JUnit.xml file for each step
flexible configuration mechanism via overrides
templates for object creation
request rate configurable for each step
can gather metrics data

Advantages of e2e.test:

integration with existing tests
more familiar to developer (?) than CL2

My two cents regarding steps foward: I am worried that if we start adding more stress tests to e2e.test, then eventually we'll want more of the features that CL2 already has and will end up re-implementing CL2 here. I'd prefer to focus on e2e.tests for functional tests that need little to no additional parameters besides the driver configuration and on CL2 for performance tests.

RaunakShah · 2021-03-02T06:21:34Z

Discussed the two options (clusterloader2 vs e2e.test) with @xing-yang and @pohly. We concluded on the following:

Since the original intention of this PR was to run this performance test as part of the external sidecar prow jobs, it makes sense to keep this in e2e.test. @pohly since the other reviewers have already approved this PR, can you take a look as well and provide any comments?
We'll dig deeper into clusterloader2 to leverage it for long term performance bench marking.

pohly

Overall this looks fine. I've mostly looked at this from a user perspective, so my comments are mostly around documentation.

pohly · 2021-03-02T08:41:27Z

test/e2e/storage/drivers/csi.go

+					Count:      300,
+					ExpectedMetrics: &storageframework.Metrics{
+						AvgLatency: 2 * time.Minute,
+						Throughput: 0.5,


Where did these numbers come from? If they reflect what was seen in some specific system (for example, testing on Prow) then it would be worthwhile to document this here in a comment.

I'm a bit worried about random test flakes in Prow jobs. Anything timing based is problematic because performance is not very deterministic.

Opted for these numbers based on testing on my local setup (kind + hostpath CSI) and then gave it a loooot of leeway. The intention (for now) is to set the expected metrics so high that if this test fails there's definitely a performance regression.

Can you put that into a comment in the source code?

test/e2e/storage/framework/testdriver.go

pohly · 2021-03-02T08:51:59Z

test/e2e/storage/framework/testdriver.go

@@ -205,6 +206,8 @@ type DriverInfo struct {
 	StressTestOptions *StressTestOptions
 	// [Optional] Scale parameters for volume snapshot stress tests.
 	VolumeSnapshotStressTestOptions *VolumeSnapshotStressTestOptions
+	// [Optional] Parameters for performance tests
+	PerformanceTestOptions *PerformanceTestOptions


If someone wants to run stress tests with different parameters (for example, different storage classes or different sizes), then this can only be done by defining different test drivers.

I think that's okay, I just wanted to call this out. The alternative (multiple PerformanceTestOptions and then instantiating the test multiple times) would be complex, too.

test/e2e/storage/testsuites/volumeperf.go

msau42 · 2021-03-04T02:00:02Z

test/e2e/storage/testsuites/volumeperf.go

+
+var _ storageframework.TestSuite = &volumePerformanceTestSuite{}
+
+const testTimeout = 15 * time.Minute


How long does this test normally take? There is a push to move any long running tests to opt-in jobs.

It takes about 10 minutes on my local env and most of that time is for cleanup (we wait for all resources to be deleted).

msau42 · 2021-03-04T02:04:03Z

test/e2e/storage/testsuites/volumeperf.go

+			if *sc.VolumeBindingMode == storagev1.VolumeBindingWaitForFirstConsumer {
+				ginkgo.By("creating a pod referring to the claim")
+				var pod *v1.Pod
+				pod, err = e2epod.CreatePod(l.cs, pvc.Namespace, nil /* nodeSelector */, []*v1.PersistentVolumeClaim{pvc}, true /* isPrivileged */, "" /* command */)


The default max pods per node is 100, so running this with 300 pods is going to fail.

Do the prow jobs only run with a single node?

No, but the hostpath driver only works on a single node. I am not seeing this test case running in any of the pull jobs. I suspect it is because of "Serial" and "Slow" tags. Can you remove the Slow tag so we can see how long this test is taking?

Ok I guess the reason why this test is passing is because it's not running with "WaitForFirstConsumer". But I suspect that if you change the hostpath driver test to use it, then setting 300 pods is going to fail because you can only have a max of 100 pods per node.

Yes i see what you mean.. Do you want me to remove support for WFFC for now and revisit it once i've tested it locally?

Actually this test is currently skipping WFFC - https://github.com/kubernetes/kubernetes/pull/94334/files#diff-7f88229dff9847a80f3ff9c3fe105a83c19b0b0ef596186a84cfc29337fbaf88R174

I guess since the test is already skipping it then it's fine. It's just confusing because this code will never be executed, so removing it would avoid that confusion.

RaunakShah · 2021-03-08T18:41:57Z

/test pull-kubernetes-integration

RaunakShah · 2021-03-10T17:59:52Z

/retest

RaunakShah · 2021-03-13T00:08:41Z

@msau42 i see that the test passed in ~8m

Kubernetes e2e suite: [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (filesystem volmode)] volume-lifecycle-performance should provision volumes at scale within performance constraints [Serial] | 8m15s
-- | --

Do you want me to add the tag back or does that not count as slow enough for [Slow]?

RaunakShah · 2021-03-13T00:09:31Z

Also these were the metrics from the test:

I0309 18:29:48.623] Mar  9 18:29:48.620: INFO: Metrics to evaluate: (*framework.Metrics)(0xc003d47190)({
I0309 18:29:48.623]  AvgLatency: (time.Duration) 1m15.751037471s,
I0309 18:29:48.623]  Throughput: (float64) 2.4025551621793855
I0309 18:29:48.623] })

msau42 · 2021-03-17T16:46:23Z

Do you want me to add the tag back or does that not count as slow enough for [Slow]?

Looking at the other tests running here, I see most are taking 1-2 minutes, so I think it makes sense to add the [Slow] tag back.

RaunakShah · 2021-03-24T06:14:41Z

/retest

…ations. This test currently validates latency and throughput of volume provisioning against a high baseline.

msau42 · 2021-03-24T20:55:43Z

/lgtm
/approve

Thanks!

k8s-ci-robot · 2021-03-24T20:55:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, RaunakShah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e/storage/OWNERS~~ [msau42]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

RaunakShah · 2021-03-24T22:34:15Z

/test pull-kubernetes-e2e-gce-storage-snapshot

msau42 · 2021-03-24T23:55:37Z

/milestone v1.21

RaunakShah · 2021-03-25T01:00:57Z

/test pull-kubernetes-e2e-gce-storage-snapshot

xing-yang · 2021-03-25T03:15:32Z

/test pull-kubernetes-e2e-kind-ipv6

RaunakShah · 2021-03-25T05:39:20Z

/test pull-kubernetes-e2e-kind-ipv6

RaunakShah · 2021-03-25T05:43:47Z

/test pull-kubernetes-e2e-gce-ubuntu-containerd

k8s-ci-robot added area/test sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 29, 2020

k8s-ci-robot requested review from copejon and gnufied August 29, 2020 20:15

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 31, 2020

RaunakShah force-pushed the volume_provision_perf branch from 3b22cb9 to 1ef5e22 Compare August 31, 2020 18:35

k8s-ci-robot assigned xing-yang Sep 1, 2020

k8s-ci-robot assigned msau42 Sep 1, 2020

xing-yang reviewed Sep 1, 2020

View reviewed changes

test/e2e/storage/testsuites/volumeperf.go Outdated Show resolved Hide resolved

k8s-ci-robot assigned pohly Oct 5, 2020

pohly reviewed Oct 5, 2020

View reviewed changes

test/e2e/storage/testsuites/volumeperf.go Outdated Show resolved Hide resolved

RaunakShah force-pushed the volume_provision_perf branch from 00cc483 to d8711c9 Compare January 20, 2021 19:45

RaunakShah force-pushed the volume_provision_perf branch from d8711c9 to 9c6aec5 Compare March 2, 2021 06:17

pohly reviewed Mar 2, 2021

View reviewed changes

RaunakShah force-pushed the volume_provision_perf branch from e16e00e to 2001aa0 Compare March 4, 2021 00:02

msau42 reviewed Mar 4, 2021

View reviewed changes

Add e2e test to validate performance metrics of volume lifecycle oper…

34e4a5f

…ations. This test currently validates latency and throughput of volume provisioning against a high baseline.

RaunakShah force-pushed the volume_provision_perf branch from 8b5acd6 to 34e4a5f Compare March 24, 2021 20:51

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 24, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 24, 2021

k8s-ci-robot added this to the v1.21 milestone Mar 24, 2021

k8s-ci-robot merged commit 2eb6911 into kubernetes:master Mar 25, 2021

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 27, 2021


		var _ storageframework.TestSuite = &volumePerformanceTestSuite{}

		const testTimeout = 15 * time.Minute

Add e2e test to validate performance metrics of volume lifecycle operations #94334

Add e2e test to validate performance metrics of volume lifecycle operations #94334

Conversation

RaunakShah commented Aug 29, 2020 • edited by liggitt

k8s-ci-robot commented Aug 29, 2020

RaunakShah commented Aug 29, 2020

eddiezane commented Aug 31, 2020

RaunakShah commented Aug 31, 2020

xing-yang commented Sep 1, 2020

xing-yang commented Sep 1, 2020

xing-yang commented Sep 20, 2020

xing-yang commented Sep 21, 2020

xing-yang commented Oct 5, 2020

RaunakShah commented Jan 20, 2021

pohly commented Feb 9, 2021

RaunakShah commented Mar 2, 2021

pohly left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RaunakShah Mar 23, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RaunakShah commented Mar 8, 2021

RaunakShah commented Mar 10, 2021

RaunakShah commented Mar 13, 2021

RaunakShah commented Mar 13, 2021

msau42 commented Mar 17, 2021

RaunakShah commented Mar 24, 2021

msau42 commented Mar 24, 2021

k8s-ci-robot commented Mar 24, 2021

RaunakShah commented Mar 24, 2021

msau42 commented Mar 24, 2021

RaunakShah commented Mar 25, 2021

xing-yang commented Mar 25, 2021

RaunakShah commented Mar 25, 2021

RaunakShah commented Mar 25, 2021

RaunakShah commented Aug 29, 2020 •

edited by liggitt

RaunakShah Mar 23, 2021 •

edited