New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add e2e test to validate performance metrics of volume lifecycle operations #94334
Add e2e test to validate performance metrics of volume lifecycle operations #94334
Conversation
Hi @RaunakShah. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
cc: @msau42 @xing-yang |
/ok-to-test |
3b22cb9
to
1ef5e22
Compare
/test pull-kubernetes-bazel-test |
/assign |
/assign @msau42 |
/retest |
@msau42, you mentioned several different types of stress tests. Can you take a look of this PR? |
/assign @pohly |
00cc483
to
d8711c9
Compare
@msau42 yes! Do we want to merge this PR and then add some follow up? |
Here's a quick comparison between running tests via e2e.test (this PR) and Cluster Loader 2 = CL2. For an example of using the latter, see https://github.com/kubernetes/perf-tests/blob/master/adhoc/log/distributed-provisioning.md Advantages of CL2:
Advantages of e2e.test:
My two cents regarding steps foward: I am worried that if we start adding more stress tests to e2e.test, then eventually we'll want more of the features that CL2 already has and will end up re-implementing CL2 here. I'd prefer to focus on e2e.tests for functional tests that need little to no additional parameters besides the driver configuration and on CL2 for performance tests. |
d8711c9
to
9c6aec5
Compare
Discussed the two options (clusterloader2 vs e2e.test) with @xing-yang and @pohly. We concluded on the following:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks fine. I've mostly looked at this from a user perspective, so my comments are mostly around documentation.
Count: 300, | ||
ExpectedMetrics: &storageframework.Metrics{ | ||
AvgLatency: 2 * time.Minute, | ||
Throughput: 0.5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did these numbers come from? If they reflect what was seen in some specific system (for example, testing on Prow) then it would be worthwhile to document this here in a comment.
I'm a bit worried about random test flakes in Prow jobs. Anything timing based is problematic because performance is not very deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opted for these numbers based on testing on my local setup (kind + hostpath CSI) and then gave it a loooot of leeway. The intention (for now) is to set the expected metrics so high that if this test fails there's definitely a performance regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put that into a comment in the source code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -205,6 +206,8 @@ type DriverInfo struct { | |||
StressTestOptions *StressTestOptions | |||
// [Optional] Scale parameters for volume snapshot stress tests. | |||
VolumeSnapshotStressTestOptions *VolumeSnapshotStressTestOptions | |||
// [Optional] Parameters for performance tests | |||
PerformanceTestOptions *PerformanceTestOptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone wants to run stress tests with different parameters (for example, different storage classes or different sizes), then this can only be done by defining different test drivers.
I think that's okay, I just wanted to call this out. The alternative (multiple PerformanceTestOptions
and then instantiating the test multiple times) would be complex, too.
e16e00e
to
2001aa0
Compare
|
||
var _ storageframework.TestSuite = &volumePerformanceTestSuite{} | ||
|
||
const testTimeout = 15 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this test normally take? There is a push to move any long running tests to opt-in jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It takes about 10 minutes on my local env and most of that time is for cleanup (we wait for all resources to be deleted).
if *sc.VolumeBindingMode == storagev1.VolumeBindingWaitForFirstConsumer { | ||
ginkgo.By("creating a pod referring to the claim") | ||
var pod *v1.Pod | ||
pod, err = e2epod.CreatePod(l.cs, pvc.Namespace, nil /* nodeSelector */, []*v1.PersistentVolumeClaim{pvc}, true /* isPrivileged */, "" /* command */) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default max pods per node is 100, so running this with 300 pods is going to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the prow jobs only run with a single node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but the hostpath driver only works on a single node. I am not seeing this test case running in any of the pull jobs. I suspect it is because of "Serial" and "Slow" tags. Can you remove the Slow tag so we can see how long this test is taking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I guess the reason why this test is passing is because it's not running with "WaitForFirstConsumer". But I suspect that if you change the hostpath driver test to use it, then setting 300 pods is going to fail because you can only have a max of 100 pods per node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes i see what you mean.. Do you want me to remove support for WFFC for now and revisit it once i've tested it locally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this test is currently skipping WFFC - https://github.com/kubernetes/kubernetes/pull/94334/files#diff-7f88229dff9847a80f3ff9c3fe105a83c19b0b0ef596186a84cfc29337fbaf88R174
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess since the test is already skipping it then it's fine. It's just confusing because this code will never be executed, so removing it would avoid that confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed!
/test pull-kubernetes-integration |
/retest |
@msau42 i see that the test passed in ~8m
Do you want me to add the tag back or does that not count as slow enough for |
Also these were the metrics from the test:
|
Looking at the other tests running here, I see most are taking 1-2 minutes, so I think it makes sense to add the [Slow] tag back. |
/retest |
…ations. This test currently validates latency and throughput of volume provisioning against a high baseline.
8b5acd6
to
34e4a5f
Compare
/lgtm Thanks! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msau42, RaunakShah The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-e2e-gce-storage-snapshot |
/milestone v1.21 |
/test pull-kubernetes-e2e-gce-storage-snapshot |
/test pull-kubernetes-e2e-kind-ipv6 |
1 similar comment
/test pull-kubernetes-e2e-kind-ipv6 |
/test pull-kubernetes-e2e-gce-ubuntu-containerd |
What type of PR is this?
/kind feature
(Not sure if a test counts as a feature)
What this PR does / why we need it:
This test validates latency and throughput of volume
provisioning against a high baseline.
Test Steps:
Future work: Add more metrics and calculate performance metrics for other types of operations.
Which issue(s) this PR fixes:
Follow up to comments from @msau42 based on some performance improvements we've made to external sidecars.
Special notes for your reviewer:
Baselines are set to quite high values. Not sure what we should keep it as right now..
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: