Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add e2e test to validate performance metrics of volume lifecycle operations #94334

Merged
merged 1 commit into from Mar 25, 2021

Conversation

RaunakShah
Copy link
Contributor

@RaunakShah RaunakShah commented Aug 29, 2020

What type of PR is this?
/kind feature
(Not sure if a test counts as a feature)

What this PR does / why we need it:
This test validates latency and throughput of volume
provisioning against a high baseline.

Test Steps:

  1. Create a Storage Class
  2. Set up an informer on PVCs
  3. Create 500 PVCs
  • Store create time for each PVC
  1. Wait for all PVCs to enter Bound state
  • Use informer to store time when every PVC enters Bound state.
  1. End wait if:
  • All PVCs enter Bound state (or)
  • Test times out
  1. Calculate performance metrics based on above times
  2. Validate performance metrics against baseline.

Future work: Add more metrics and calculate performance metrics for other types of operations.

Which issue(s) this PR fixes:

Follow up to comments from @msau42 based on some performance improvements we've made to external sidecars.

Special notes for your reviewer:
Baselines are set to quite high values. Not sure what we should keep it as right now..
Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 29, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @RaunakShah. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@RaunakShah
Copy link
Contributor Author

cc: @msau42 @xing-yang

@k8s-ci-robot k8s-ci-robot added area/test sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 29, 2020
@eddiezane
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 31, 2020
@RaunakShah
Copy link
Contributor Author

/test pull-kubernetes-bazel-test

@xing-yang
Copy link
Contributor

/assign

@xing-yang
Copy link
Contributor

/assign @msau42

@xing-yang
Copy link
Contributor

/retest

@xing-yang
Copy link
Contributor

@msau42, you mentioned several different types of stress tests. Can you take a look of this PR?

@xing-yang
Copy link
Contributor

/assign @pohly

@RaunakShah
Copy link
Contributor Author

@msau42 yes! Do we want to merge this PR and then add some follow up?

@pohly
Copy link
Contributor

pohly commented Feb 9, 2021

Here's a quick comparison between running tests via e2e.test (this PR) and Cluster Loader 2 = CL2. For an example of using the latter, see https://github.com/kubernetes/perf-tests/blob/master/adhoc/log/distributed-provisioning.md

Advantages of CL2:

  • tests are divided into steps, with timing and success recorded in a JUnit.xml file for each step
  • flexible configuration mechanism via overrides
  • templates for object creation
  • request rate configurable for each step
  • can gather metrics data

Advantages of e2e.test:

  • integration with existing tests
  • more familiar to developer (?) than CL2

My two cents regarding steps foward: I am worried that if we start adding more stress tests to e2e.test, then eventually we'll want more of the features that CL2 already has and will end up re-implementing CL2 here. I'd prefer to focus on e2e.tests for functional tests that need little to no additional parameters besides the driver configuration and on CL2 for performance tests.

@RaunakShah
Copy link
Contributor Author

Discussed the two options (clusterloader2 vs e2e.test) with @xing-yang and @pohly. We concluded on the following:

  1. Since the original intention of this PR was to run this performance test as part of the external sidecar prow jobs, it makes sense to keep this in e2e.test. @pohly since the other reviewers have already approved this PR, can you take a look as well and provide any comments?
  2. We'll dig deeper into clusterloader2 to leverage it for long term performance bench marking.

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks fine. I've mostly looked at this from a user perspective, so my comments are mostly around documentation.

Count: 300,
ExpectedMetrics: &storageframework.Metrics{
AvgLatency: 2 * time.Minute,
Throughput: 0.5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did these numbers come from? If they reflect what was seen in some specific system (for example, testing on Prow) then it would be worthwhile to document this here in a comment.

I'm a bit worried about random test flakes in Prow jobs. Anything timing based is problematic because performance is not very deterministic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opted for these numbers based on testing on my local setup (kind + hostpath CSI) and then gave it a loooot of leeway. The intention (for now) is to set the expected metrics so high that if this test fails there's definitely a performance regression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put that into a comment in the source code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

test/e2e/storage/framework/testdriver.go Show resolved Hide resolved
@@ -205,6 +206,8 @@ type DriverInfo struct {
StressTestOptions *StressTestOptions
// [Optional] Scale parameters for volume snapshot stress tests.
VolumeSnapshotStressTestOptions *VolumeSnapshotStressTestOptions
// [Optional] Parameters for performance tests
PerformanceTestOptions *PerformanceTestOptions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone wants to run stress tests with different parameters (for example, different storage classes or different sizes), then this can only be done by defining different test drivers.

I think that's okay, I just wanted to call this out. The alternative (multiple PerformanceTestOptions and then instantiating the test multiple times) would be complex, too.

test/e2e/storage/testsuites/volumeperf.go Outdated Show resolved Hide resolved

var _ storageframework.TestSuite = &volumePerformanceTestSuite{}

const testTimeout = 15 * time.Minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this test normally take? There is a push to move any long running tests to opt-in jobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It takes about 10 minutes on my local env and most of that time is for cleanup (we wait for all resources to be deleted).

if *sc.VolumeBindingMode == storagev1.VolumeBindingWaitForFirstConsumer {
ginkgo.By("creating a pod referring to the claim")
var pod *v1.Pod
pod, err = e2epod.CreatePod(l.cs, pvc.Namespace, nil /* nodeSelector */, []*v1.PersistentVolumeClaim{pvc}, true /* isPrivileged */, "" /* command */)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default max pods per node is 100, so running this with 300 pods is going to fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the prow jobs only run with a single node?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but the hostpath driver only works on a single node. I am not seeing this test case running in any of the pull jobs. I suspect it is because of "Serial" and "Slow" tags. Can you remove the Slow tag so we can see how long this test is taking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I guess the reason why this test is passing is because it's not running with "WaitForFirstConsumer". But I suspect that if you change the hostpath driver test to use it, then setting 300 pods is going to fail because you can only have a max of 100 pods per node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes i see what you mean.. Do you want me to remove support for WFFC for now and revisit it once i've tested it locally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess since the test is already skipping it then it's fine. It's just confusing because this code will never be executed, so removing it would avoid that confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

@RaunakShah
Copy link
Contributor Author

/test pull-kubernetes-integration

@RaunakShah
Copy link
Contributor Author

/retest

@RaunakShah
Copy link
Contributor Author

@msau42 i see that the test passed in ~8m

Kubernetes e2e suite: [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (filesystem volmode)] volume-lifecycle-performance should provision volumes at scale within performance constraints [Serial] | 8m15s
-- | --

Do you want me to add the tag back or does that not count as slow enough for [Slow]?

@RaunakShah
Copy link
Contributor Author

Also these were the metrics from the test:

I0309 18:29:48.623] Mar  9 18:29:48.620: INFO: Metrics to evaluate: (*framework.Metrics)(0xc003d47190)({
I0309 18:29:48.623]  AvgLatency: (time.Duration) 1m15.751037471s,
I0309 18:29:48.623]  Throughput: (float64) 2.4025551621793855
I0309 18:29:48.623] })

@msau42
Copy link
Member

msau42 commented Mar 17, 2021

Do you want me to add the tag back or does that not count as slow enough for [Slow]?

Looking at the other tests running here, I see most are taking 1-2 minutes, so I think it makes sense to add the [Slow] tag back.

@RaunakShah
Copy link
Contributor Author

/retest

…ations.

This test currently validates latency and throughput of volume
provisioning against a high baseline.
@msau42
Copy link
Member

msau42 commented Mar 24, 2021

/lgtm
/approve

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 24, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, RaunakShah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 24, 2021
@RaunakShah
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-storage-snapshot

@msau42
Copy link
Member

msau42 commented Mar 24, 2021

/milestone v1.21

@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Mar 24, 2021
@RaunakShah
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-storage-snapshot

@xing-yang
Copy link
Contributor

/test pull-kubernetes-e2e-kind-ipv6

1 similar comment
@RaunakShah
Copy link
Contributor Author

/test pull-kubernetes-e2e-kind-ipv6

@RaunakShah
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-ubuntu-containerd

@k8s-ci-robot k8s-ci-robot merged commit 2eb6911 into kubernetes:master Mar 25, 2021
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants