Speed up PV provisioning for vsphere driver #100054

gnufied · 2021-03-10T02:54:54Z

vSphere PV provisioning progressively gets slower as more nodes and hosts are added to the cluster. I think this is mainly because number of repeated API calls to vCenter we make during volume provisioning.

Typically we make following calls for immediate binding non-zone based provisioning:

We get list of nodes(lets call it N) from node manager (which in itself could trigger API calls)
For each node, we fetch its host information from node itself. This requires additional API calls for each node in the cluster.(N number of API calls)
Then from the host ref obtained from previous API call, we fetch full host information including its datastore. This is done for each node separately.
For all of datastore refs obtained from previous API call, we fetch full datastore information.
For each datastore obtained from previous API calls, we make separate API call to construct full Datastore object. This is especially problematic because any number of datastores could be attached to a host and hence fetching them individually could be expensive.

So, this results in quite a bit of API calls to vCenter API.

This PR simplifies that logic:

We get list of nodes from node manager (which in itself could trigger API calls)
For every known node, we fetch host information in a one-shot API call (and hence no per-node call)
For all of host ref obtained from previous list, we make a single API call to fetch full host information.
Now rather than checking which datastores are shared with these hosts, we just check hosts that are attached to selected "candidate" datastores (this saves a lot of API calls)
Now in last step - we just check if hosts to which "candidate" store is attached is a super-set of host set we built in step#3.

This PR reduces number of API calls to vCenter from potential hundreds to 5 or 6.

Caveats:

I have not fixed code path that uses spbm policies. Those still result in same number of API calls (but we could fix progressively).
We rely on host's UUID to check if node hosts is a subset of "attached hosts" of datastore. I would request reviewers to review that part of code carefully.

Fixes #99372

/sig storage
/kind bug

Improve speed of vSphere PV provisioning and reduce number of API calls

gnufied · 2021-03-10T02:58:12Z

/assign @divyenpatel

gnufied · 2021-03-10T03:01:15Z

/triage accepted
/priority important-soon

gnufied · 2021-03-10T03:01:30Z

cc @yuga711

yangjunmyfm192085 · 2021-03-10T04:56:08Z

/retest

gnufied · 2021-03-10T15:57:03Z

/retest

divyenpatel · 2021-03-10T23:04:43Z

@gnufied Can you share the testing result on the PR? Is this change validated by creating few volumes?

staging/src/k8s.io/legacy-cloud-providers/vsphere/shared_datastore.go

divyenpatel · 2021-03-10T23:24:57Z

staging/src/k8s.io/legacy-cloud-providers/vsphere/shared_datastore.go

+		}
+		for _, host := range hostInfos {
+			hostDCName := fmt.Sprintf("%s/%s", host.datacenter, host.hostMOID)
+			nodeHosts[hostDCName] = host


I am not getting why we are using hostDCName as the key for nodeHosts?
I see hostDCName is created using fmt.Sprintf("%s/%s", host.datacenter, host.hostMOID).

I am doing this to eliminate duplicate hosts that will likely happen because many nodes will likely run on the same host. Also I am using a combination of host.datacenter and host.hostMOID because - if I remember correctly the hostMOID (which will look like HostSystsem:host-213) is only guaranteed to be unique in a single vcenter, so if somehow more than one host has same hostMOID it will consider them as distinct hosts.

Can you please add this knowledge as a comment, e.g. in the hostInfo struct?

staging/src/k8s.io/legacy-cloud-providers/vsphere/shared_datastore.go

RaunakShah · 2021-03-11T00:11:52Z

staging/src/k8s.io/legacy-cloud-providers/vsphere/shared_datastore.go

+		dcNodes[vcDC] = append(dcNodes[vcDC], nodeInfo)
+	}
+
+	for dcVC, nodes := range dcNodes {


nit: dcNodes is populated with keys vcDC but here the keys are dcVC. Can it be consistent?

staging/src/k8s.io/legacy-cloud-providers/vsphere/shared_datastore.go

RaunakShah · 2021-03-11T00:21:34Z

staging/src/k8s.io/legacy-cloud-providers/vsphere/shared_datastore.go

+	result := true
+	for _, host := range nodeHosts {
+		hostFound := false
+		for _, targetHost := range dataStoreHosts {


Will it be very costly here to iterate through all dataStoreHosts for all nodeHosts for all candidateDatastores? Can host. hostUUID and host. hostMOID be stored in nodeHosts instead for a simple look up?

What do you mean by "all candidateDatastores", we just need first one we find.

staging/src/k8s.io/legacy-cloud-providers/vsphere/shared_datastore.go

staging/src/k8s.io/legacy-cloud-providers/vsphere/vsphere.go

gnufied · 2021-03-11T04:12:27Z

@divyenpatel yes I have tested both successful cases (i.e where datastore was shared with hosts) and failure cases (where datastore was not shared with all hosts) and they both work after this change.

For example:

Failed case

E0311 04:10:45.175143 3096210 vsphere.go:1397] Failed to get shared datastore: unable to find any shared datastores
I0311 04:10:45.175195 3096210 pv_controller.go:1580] failed to provision volume for claim "default/myclaim" with StorageClass "thin": unable to find any shared datastores
E0311 04:10:45.175240 3096210 goroutinemap.go:150] Operation for "provision-default/myclaim[e371dae5-89d9-45ae-9d39-f8fa91d10dc6]" failed. No retries permitted until 2021-03-11 04:10:46.175223777 +0000 UTC m=+65.627356168 (durationBeforeRetry 1s). Error: "unable to find any shared datastores"
I0311 04:10:45.175305 3096210 event.go:291] "Event occurred" object="default/myclaim" kind="PersistentVolumeClaim" apiVersion="v1" type="Warning" reason="ProvisioningFailed" message="Failed to provision volume with StorageClass \"thin\": unable to find any shared datastores"
^C

Success Case

I0311 04:15:56.836150 3105091 vsphere_volume.go:375] Provision with selectedNode:  and allowedTopologies : []
I0311 04:15:56.836191 3105091 vsphere.go:1249] Starting to create a vSphere volume with volumeOptions: &{CapacityKB:1048576 Tags:map[kubernetes.io/created-for/pv/name:pvc-233cc8d4-3a30-4d81-9aea-902ece03869e kubernetes.io/created-for/pvc/name:myclaim kubernetes.io/created-for/pvc/namespace:default] Name:kubernetes-dynamic-pvc-233cc8d4-3a30-4d81-9aea-902ece03869e DiskFormat:thin Datastore: VSANStorageProfileData: StoragePolicyName: StoragePolicyID: SCSIControllerType: Zone:[] SelectedNode:nil}
I0311 04:15:56.842483 3105091 deployment_controller.go:490] "Error syncing deployment" deployment="default/sandbox" err="Operation cannot be fulfilled on deployments.apps \"sandbox\": the object has been modified; please apply your changes to the latest version and try again"
I0311 04:15:56.855739 3105091 vsphere.go:1301] Volume topology : []
W0311 04:15:57.037626 3105091 datacenter.go:269] QueryVirtualDiskUuid failed for diskPath: "[nfsdata] kubevols/kube-dummyDisk.vmdk". err: ServerFaultCode: File [nfsdata] kubevols/kube-dummyDisk.vmdk was not found
I0311 04:15:57.037680 3105091 vsphere_volume_util.go:171] Successfully created vsphere volume kubernetes-dynamic-pvc-233cc8d4-3a30-4d81-9aea-902ece03869e
I0311 04:15:57.037708 3105091 pv_controller.go:1585] volume "pvc-233cc8d4-3a30-4d81-9aea-902ece03869e" for claim "default/myclaim" created
I0311 04:15:57.043546 3105091 pv_controller.go:1610] volume "pvc-233cc8d4-3a30-4d81-9aea-902ece03869e" for claim "default/myclaim" saved
I0311 04:15:57.043607 3105091 pv_controller.go:1663] volume "pvc-233cc8d4-3a30-4d81-9aea-902ece03869e" provisioned for claim "default/myclaim"

staging/src/k8s.io/legacy-cloud-providers/vsphere/shared_datastore.go

gnufied · 2021-03-12T15:17:08Z

@divyenpatel @RaunakShah are there any outstanding comments here? Can you please check? I am trying to get this in as a bugfix in 1.21

RaunakShah · 2021-03-12T17:29:56Z

/approve

divyenpatel · 2021-03-12T19:59:05Z

/lgtm

gnufied · 2021-03-15T01:54:40Z

/assign @andrewsykim

staging/src/k8s.io/legacy-cloud-providers/vsphere/vsphere.go

staging/src/k8s.io/legacy-cloud-providers/vsphere/shared_datastore.go

Speeds up PV provisioning for vSphere driver by using bulk fetching of hosts and reversing logic that fetches datastores Add warning about fetching hosts individually

jsafrane · 2021-03-16T16:27:39Z

This is a bugfix for #99372
/milestone 1.21
/lgtm

k8s-ci-robot · 2021-03-16T16:27:42Z

@jsafrane: The provided milestone is not valid for this repository. Milestones in this repository: [next-candidate, v1.16, v1.17, v1.18, v1.19, v1.20, v1.21, v1.22]

Use /milestone clear to clear the milestone.

In response to this:

This is a bugfix for #99372
/milestone 1.21
/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jsafrane · 2021-03-16T16:27:53Z

/milestone v1.21

jsafrane · 2021-03-16T17:46:34Z

/retest

jsafrane · 2021-03-17T13:25:44Z

/approve
from sig-storage perspective

/assign @andrewsykim
for cloud provider approval

andrewsykim · 2021-03-17T14:34:38Z

/assign @cheftako

@cheftalp I think this one is yours since it's vsphere-related

cheftako · 2021-03-22T16:30:16Z

staging/src/k8s.io/legacy-cloud-providers/vsphere/shared_datastore.go

+	var vmoList []mo.VirtualMachine
+	err := pc.Retrieve(ctx, vmRefs, []string{nameProperty, runtimeHost}, &vmoList)
+	if err != nil {
+		klog.Errorf("SharedHost.getNodeHosts: unable to fetch vms from datacenter %s: %w", nodeInfo.dataCenter.String(), err)


Was this supposed to be %v for the error?

NM unwrap, cool.

this breaks go test:

go test k8s.io/legacy-cloud-providers/vsphere -mod=readonly # k8s.io/legacy-cloud-providers/vsphere staging/src/k8s.io/legacy-cloud-providers/vsphere/shared_datastore.go:156:3: Errorf call has error-wrapping directive %w staging/src/k8s.io/legacy-cloud-providers/vsphere/shared_datastore.go:172:3: Errorf call has error-wrapping directive %w FAIL k8s.io/legacy-cloud-providers/vsphere [build failed] FAIL

cheftako · 2021-03-22T16:30:31Z

staging/src/k8s.io/legacy-cloud-providers/vsphere/shared_datastore.go

+	pc = property.DefaultCollector(nodeInfo.dataCenter.Client())
+	err = pc.Retrieve(ctx, hostRefs, []string{summary}, &hostMoList)
+	if err != nil {
+		klog.Errorf("SharedHost.getNodeHosts: unable to fetch hosts from datacenter %s: %w", nodeInfo.dataCenter.String(), err)


Same question here.

this also breaks go test

cheftako · 2021-03-22T16:34:29Z

/lgtm
/approve

k8s-ci-robot · 2021-03-22T16:34:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheftako, divyenpatel, gnufied, jsafrane, RaunakShah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/legacy-cloud-providers/OWNERS~~ [cheftako]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gnufied · 2021-03-22T17:56:59Z

/retest

k8s-ci-robot requested review from BaluDontu and SandeepPissay March 10, 2021 02:55

k8s-ci-robot added area/cloudprovider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. labels Mar 10, 2021

k8s-ci-robot assigned divyenpatel Mar 10, 2021

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 10, 2021

divyenpatel reviewed Mar 10, 2021

View reviewed changes

RaunakShah reviewed Mar 11, 2021

View reviewed changes

yuga711 reviewed Mar 11, 2021

View reviewed changes

staging/src/k8s.io/legacy-cloud-providers/vsphere/shared_datastore.go Show resolved Hide resolved

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 12, 2021

k8s-ci-robot assigned andrewsykim Mar 15, 2021

jsafrane reviewed Mar 15, 2021

View reviewed changes

staging/src/k8s.io/legacy-cloud-providers/vsphere/vsphere.go Outdated Show resolved Hide resolved

jsafrane reviewed Mar 15, 2021

View reviewed changes

staging/src/k8s.io/legacy-cloud-providers/vsphere/vsphere.go Outdated Show resolved Hide resolved

jsafrane reviewed Mar 15, 2021

View reviewed changes

staging/src/k8s.io/legacy-cloud-providers/vsphere/shared_datastore.go Show resolved Hide resolved

Speed up PV provisioning for vsphere driver

c3cc5a4

Speeds up PV provisioning for vSphere driver by using bulk fetching of hosts and reversing logic that fetches datastores Add warning about fetching hosts individually

gnufied force-pushed the make-pv-provisioning-faster-vsphere branch from 3c7637e to c3cc5a4 Compare March 16, 2021 16:21

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 16, 2021

k8s-ci-robot assigned jsafrane Mar 16, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 16, 2021

k8s-ci-robot added this to the v1.21 milestone Mar 16, 2021

cheftako reviewed Mar 22, 2021

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 22, 2021

k8s-ci-robot merged commit 9c4ffa7 into kubernetes:master Mar 22, 2021

This was referenced Apr 20, 2021

Bug 1951726: Speed up PV provisioning for vsphere driver (ocp-4.7) openshift/kubernetes#690

Merged

Bug 1951727: Speed up PV provisioning for vsphere driver (ocp-4.6) openshift/kubernetes#691

Merged

This was referenced May 26, 2021

Speed up PV provisioning for vsphere driver #102350

Merged

Speed up PV provisioning for vsphere driver #102351

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up PV provisioning for vsphere driver #100054

Speed up PV provisioning for vsphere driver #100054

gnufied commented Mar 10, 2021 •

edited

gnufied commented Mar 10, 2021

gnufied commented Mar 10, 2021

gnufied commented Mar 10, 2021

yangjunmyfm192085 commented Mar 10, 2021

gnufied commented Mar 10, 2021

divyenpatel commented Mar 10, 2021

divyenpatel Mar 10, 2021

gnufied Mar 10, 2021 •

edited

jsafrane Mar 15, 2021

RaunakShah Mar 11, 2021

gnufied Mar 11, 2021

RaunakShah Mar 11, 2021

gnufied Mar 11, 2021

gnufied commented Mar 11, 2021 •

edited

gnufied commented Mar 12, 2021

RaunakShah commented Mar 12, 2021

divyenpatel commented Mar 12, 2021

gnufied commented Mar 15, 2021

jsafrane commented Mar 16, 2021

k8s-ci-robot commented Mar 16, 2021

jsafrane commented Mar 16, 2021

jsafrane commented Mar 16, 2021

jsafrane commented Mar 17, 2021

andrewsykim commented Mar 17, 2021

cheftako Mar 22, 2021

cheftako Mar 22, 2021

liggitt Jun 14, 2021

cheftako Mar 22, 2021

liggitt Jun 14, 2021

cheftako commented Mar 22, 2021

k8s-ci-robot commented Mar 22, 2021

gnufied commented Mar 22, 2021

Speed up PV provisioning for vsphere driver #100054

Speed up PV provisioning for vsphere driver #100054

Conversation

gnufied commented Mar 10, 2021 • edited

gnufied commented Mar 10, 2021

gnufied commented Mar 10, 2021

gnufied commented Mar 10, 2021

yangjunmyfm192085 commented Mar 10, 2021

gnufied commented Mar 10, 2021

divyenpatel commented Mar 10, 2021

Choose a reason for hiding this comment

gnufied Mar 10, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied commented Mar 11, 2021 • edited

gnufied commented Mar 12, 2021

RaunakShah commented Mar 12, 2021

divyenpatel commented Mar 12, 2021

gnufied commented Mar 15, 2021

jsafrane commented Mar 16, 2021

k8s-ci-robot commented Mar 16, 2021

jsafrane commented Mar 16, 2021

jsafrane commented Mar 16, 2021

jsafrane commented Mar 17, 2021

andrewsykim commented Mar 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheftako commented Mar 22, 2021

k8s-ci-robot commented Mar 22, 2021

gnufied commented Mar 22, 2021

gnufied commented Mar 10, 2021 •

edited

gnufied Mar 10, 2021 •

edited

gnufied commented Mar 11, 2021 •

edited