Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port deviceManager to windows container manager to enable GPU access #93285

Merged
merged 2 commits into from Dec 23, 2020

Conversation

aarnaud
Copy link
Contributor

@aarnaud aarnaud commented Jul 21, 2020

What type of PR is this?
/kind feature

What this PR does / why we need it:

Port the devicemanager to Windows node to allow device plugins.

An important device plugin (not part of this PR) is the directx device plugin for GPU access: https://github.com/aarnaud/k8s-directx-device-plugin

This is required for machine learning, high-performance computing and computer graphics under Windows using the GPU.

References:
#80917
#93263
#93262
#88554

Fix issue:
#88157

Does this PR introduce a user-facing change?:

Port the devicemanager to Windows node to allow device plugins like directx

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 21, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @aarnaud. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 21, 2020
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 21, 2020
@k8s-ci-robot
Copy link
Contributor

@aarnaud: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aarnaud
Copy link
Contributor Author

aarnaud commented Jul 21, 2020

Hi @matthyx and @michmike can you add ok-to-test label on this PR, with windows test fixed now.

@matthyx
Copy link
Contributor

matthyx commented Jul 21, 2020

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 21, 2020
@thomacos
Copy link

/approve
Moral support.

@thomacos
Copy link

/test pull-kubernetes-typecheck

@thomacos
Copy link

/test pull-kubernetes-e2e-gce-ubuntu-containerd

@aarnaud
Copy link
Contributor Author

aarnaud commented Jul 21, 2020

Hi @jsturtevant @liggitt

I add again the constant file for windows to avoid this crash, and test it on windows server 2019 and all test passed.
Origin of this issue: #80917 (comment)

@aarnaud
Copy link
Contributor Author

aarnaud commented Jul 21, 2020

/assign @derekwaynecarr

@aarnaud
Copy link
Contributor Author

aarnaud commented Jul 21, 2020

/milestone v1.19

@michmike can you add this in the v1.19 like #80917

now the windows e2e test work and pass, we need to add test for device plugin on windows but I'm not comfortable with e2e test system on this project.

@k8s-ci-robot
Copy link
Contributor

@aarnaud: You must be a member of the kubernetes/milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone v1.19

@michmike can you add this in the v1.19 like #80917

now the windows e2e test work and pass, we need to add test for device plugin on windows but I'm not comfortable with e2e test system on this project.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

// DevicePluginPath is the folder the Device Plugin is expecting sockets to be on
// Only privileged pods have access to this path
// Note: Placeholder until we find a "standard path"
DevicePluginPath = "c:\\var\\lib\\kubelet\\device-plugins\\"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OS-specific constants are pretty unusual ... it means code that references this constant but is not running on the target OS will get the wrong value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand but how can I change that. by using goruntime.GOOS != "windows" in

if socketPath == "" || !filepath.IsAbs(socketPath) {

if filepath.IsAbs(criSocket) && goruntime.GOOS != "windows" {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like DevicePluginPath is only referenced by KubeletSocket, and KubeletSocket is only referenced by NewManagerImpl in-tree. If we needed a different default path per-platform, something like this would be clearer to me:

diff --git a/pkg/kubelet/cm/devicemanager/manager.go b/pkg/kubelet/cm/devicemanager/manager.go
index 5d1925f9458..14dc3be1347 100644
--- a/pkg/kubelet/cm/devicemanager/manager.go
+++ b/pkg/kubelet/cm/devicemanager/manager.go
@@ -22,6 +22,7 @@ import (
 	"net"
 	"os"
 	"path/filepath"
+	"runtime"
 	"sort"
 	"sync"
 	"time"
@@ -125,7 +126,11 @@ func (s *sourcesReadyStub) AllReady() bool          { return true }
 
 // NewManagerImpl creates a new manager.
 func NewManagerImpl(numaNodeInfo cputopology.NUMANodeInfo, topologyAffinityStore topologymanager.Store) (*ManagerImpl, error) {
-	return newManagerImpl(pluginapi.KubeletSocket, numaNodeInfo, topologyAffinityStore)
+	socketPath := pluginapi.KubeletSocket
+	if runtime.GOOS == "windows" {
+		socketPath = pluginapi.KubeletSocketWindows
+	}
+	return newManagerImpl(socketPath, numaNodeInfo, topologyAffinityStore)
 }
 
 func newManagerImpl(socketPath string, numaNodeInfo cputopology.NUMANodeInfo, topologyAffinityStore topologymanager.Store) (*ManagerImpl, error) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liggitt, I changed my PR to add your feedback

Copy link
Member

@ddebroy ddebroy Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OS-specific constants are pretty unusual ... it means code that references this constant but is not running on the target OS will get the wrong value

My understanding here was: only the builds targeting Windows will reference the constants in constants_windows.go while Linux builds will reference the corresponding constants in constants_linux.go / constants_unix.go based on: https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/constants/constants_windows.go and https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/apis/kubeadm/v1beta1/defaults_windows.go as well as in Go libraries: https://github.com/golang/go/blob/master/src/os/path_windows.go#L8

@liggitt can you explain a bit why/how this does not work especially in the context of something node-specific like kubelet. I assume the compiler will fail the build if a targeted platform does not have all the constants defined that the platform agnostic code is referring to. Is the concern that a controller may potentially use these incorrectly across a cluster?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OS-specific constants are pretty unusual ... it means code that references this constant but is not running on the target OS will get the wrong value

@ddebroy I think @liggitt's concern here is roughly that because we use a single name across platforms for this constant, it may not be obvious when we wind up with code that depends on properties of a specific platform's constant value.

As opposed to having distinct constants for each platform.

If we already have platform specific implementations referencing the constant, why not reference a platform specific constant instead ...?

it looks like DevicePluginPath is only referenced by KubeletSocket, and KubeletSocket is only referenced by NewManagerImpl in-tree. If we needed a different default path per-platform, something like this would be clearer to me:

This comment outlines this before showing a specific proposed implementation ...

@thomacos
Copy link

Hi @derekwaynecarr,

Can you approve this PR for allowing GPU access under Windows?

This PR is the same as #80917, which was already fully approved and merged, except for the following change:
A Windows-specific fix of bug #93262 requires setting the DevicePluginPath constant to a different value under Windows than under Linux https://github.com/kubernetes/kubernetes/pull/93285/files#diff-9f69ebb59ab7424eb7662a8f6046467bR29.
All tests are passing.

@michmike
Copy link
Contributor

we need someone from sig-node to approve. Yuju, can you please approve.
@liggitt since you reviewed this in the past, can you also please approve?
/assign @yujuhong

@michmike
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 16, 2020
@michmike
Copy link
Contributor

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 16, 2020
@aarnaud
Copy link
Contributor Author

aarnaud commented Dec 16, 2020

/assign @derekwaynecarr

@thomacos
Copy link

Oh no room for merging this PR huh?
Fine!
We'll go build our own Kubernetes fork.
With blackjack.
And hookers.

🕯️ Happy holidays to everyone! 😀

@michmike
Copy link
Contributor

Oh no room for merging this PR huh?
Fine!
We'll go build our own Kubernetes fork.
With blackjack.
And hookers.

🕯️ Happy holidays to everyone! 😀

@thomacos with the holidays, a lot of folks are out. we are trying to get this merged ASAP

@thomacos
Copy link

thomacos commented Dec 23, 2020

Oh no room for merging this PR huh?
Fine!
We'll go build our own Kubernetes fork.
With blackjack.
And hookers.
🕯️ Happy holidays to everyone! 😀

@thomacos with the holidays, a lot of folks are out. we are trying to get this merged ASAP

No problem! Everything is fine. Was more meant as a light-hearted comment 😀.
(With a grain of truth, but definitely not directed at you sig-windows guys - thanks for all!)
Enjoy your earned free days, @michmike!

PS: Does nobody out there know Futurama? What a miss! 😉

@dchen1107
Copy link
Member

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aarnaud, dchen1107, michmike, thomacos

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 23, 2020
@k8s-ci-robot k8s-ci-robot merged commit e20300b into kubernetes:master Dec 23, 2020
SIG-Windows automation moved this from In Review (v1.21) to Done (v1.21) Dec 23, 2020
@thomacos
Copy link

@dchen1107 thanks for approving!
@michmike thanks for your telekineticapproval abilities!

@michmike
Copy link
Contributor

@aarnaud and @thomacos this is an early xmas present! thank you to all that helped make this feature possible

@aarnaud
Copy link
Contributor Author

aarnaud commented Dec 24, 2020

thanks to all. Merry Christmas

@salanki
Copy link

salanki commented Dec 24, 2020

Great work @aarnaud

@TBBle
Copy link

TBBle commented Jan 29, 2021

If this makes device plugins work on Windows (which is my understanding) at least with dockershim, that seems like a user-visible change that should be in the v1.21 changelog, but is not.

Or this being deliberately sat-upon until #97739 is resolved so that non-deprecated container runtimes can support this?

@jsturtevant
Copy link
Contributor

Looks like there was no release note added. We should probably add one but I don't know the process for updating after the fact

@marosset
Copy link
Contributor

Looks like there was no release note added. We should probably add one but I don't know the process for updating after the fact

I think @aarnaud can just edit the initial PR comment to add a release note.
I agree this should be mentioned.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 29, 2021
@aarnaud
Copy link
Contributor Author

aarnaud commented Jan 29, 2021

I just added ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet