Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix pull image error from multiple ACRs using azure managed identity #96355

Merged
merged 1 commit into from Nov 12, 2020

Conversation

andyzhangx
Copy link
Member

@andyzhangx andyzhangx commented Nov 9, 2020

What type of PR is this?
/kind bug

What this PR does / why we need it:
fix pull image error from multiple ACRs using azure managed identity

This PR continues with PR(#92330), fixed the pull image error with azure managed identity.

This PR is using per-registry cache (also with cache expiration), pls note that per-registry cache is only for managed identity, and for service principal(account name & password), it's not necessary since password is supposed to be the same in whole kubelet life time.

Which issue(s) this PR fixes:

Fixes #92326

Special notes for your reviewer:

  • impact of this issue, you may hit this issue when pulling images from multiple different ACRs, and after a few more retries, it should work:
Events:
  Type     Reason     Age                 From                                        Message
  ----     ------     ----                ----                                        -------
  Normal   Scheduled  104s                default-scheduler                           Successfully assigned default/nginx2 to aks-nodepool1-43715599-vmss000001
  Warning  Failed     55s (x3 over 97s)   kubelet, aks-nodepool1-43715599-vmss000001  Failed to pull image "andyacr3.azurecr.io/nginx:v1": rpc error: code = Unknown desc = Error response from daemon: Get https://andyacr3.azurecr.io/v2/nginx/manifests/v1: unauthorized: authentication required, visit https://aka.ms/acr/authorization for more information.
  Warning  Failed     55s (x3 over 97s)   kubelet, aks-nodepool1-43715599-vmss000001  Error: ErrImagePull
  Normal   BackOff    27s (x4 over 97s)   kubelet, aks-nodepool1-43715599-vmss000001  Back-off pulling image "andyacr3.azurecr.io/nginx:v1"
  Warning  Failed     27s (x4 over 97s)   kubelet, aks-nodepool1-43715599-vmss000001  Error: ImagePullBackOff
  Normal   Pulling    12s (x4 over 103s)  kubelet, aks-nodepool1-43715599-vmss000001  Pulling image "andyacr3.azurecr.io/nginx:v1"
  Normal   Pulled     12s                 kubelet, aks-nodepool1-43715599-vmss000001  Successfully pulled image "andyacr3.azurecr.io/nginx:v1" in 383.922005ms
  Normal   Created    12s                 kubelet, aks-nodepool1-43715599-vmss000001  Created container nginx
  Normal   Started    12s                 kubelet, aks-nodepool1-43715599-vmss000001  Started container nginx

With this fix, there is no authentication error when pulling images from two ACRs using managed identity:

Events:
  Type    Reason     Age   From                                        Message
  ----    ------     ----  ----                                        -------
  Normal  Scheduled  14s   default-scheduler                           Successfully assigned default/nginx to aks-nodepool1-43715599-vmss000000
  Normal  Pulling    13s   kubelet, aks-nodepool1-43715599-vmss000000  Pulling image "andyacr2.azurecr.io/nginx:v1"
  Normal  Pulled     6s    kubelet, aks-nodepool1-43715599-vmss000000  Successfully pulled image "andyacr2.azurecr.io/nginx:v1" in 7.513394748s
  Normal  Created    2s    kubelet, aks-nodepool1-43715599-vmss000000  Created container nginx
  Normal  Started    2s    kubelet, aks-nodepool1-43715599-vmss000000  Started container nginx

Events:
  Type    Reason     Age   From                                        Message
  ----    ------     ----  ----                                        -------
  Normal  Scheduled  13s   default-scheduler                           Successfully assigned default/nginx2 to aks-nodepool1-43715599-vmss000000
  Normal  Pulling    13s   kubelet, aks-nodepool1-43715599-vmss000000  Pulling image "andyacr3.azurecr.io/nginx:v1"
  Normal  Pulled     2s    kubelet, aks-nodepool1-43715599-vmss000000  Successfully pulled image "andyacr3.azurecr.io/nginx:v1" in 10.224016565s
  Normal  Created    2s    kubelet, aks-nodepool1-43715599-vmss000000  Created container nginx
  Normal  Started    2s    kubelet, aks-nodepool1-43715599-vmss000000  Started container nginx

Does this PR introduce a user-facing change?:

fix pull image error from multiple ACRs using azure managed identity

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

fix pull image error from multiple ACRs using azure managed identity

/priority important-soon
/sig cloud-provider
/area provider/azure
/triage accepted
/assign @feiskyer

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note Denotes a PR that will be considered when it comes time to generate release notes. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. triage/accepted Indicates an issue or PR is ready to be actively worked on. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 9, 2020
@andyzhangx
Copy link
Member Author

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 9, 2020
@andyzhangx
Copy link
Member Author

/test pull-kubernetes-bazel-test

@feiskyer
Copy link
Member

feiskyer commented Nov 9, 2020

/retest

@andyzhangx
Copy link
Member Author

/hold cancel
test pass: With this fix, there is no authentication error when pulling images from two ACRs using managed identity

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2020
@andyzhangx
Copy link
Member Author

/retest

@andyzhangx
Copy link
Member Author

/retest

2 similar comments
@andyzhangx
Copy link
Member Author

/retest

@andyzhangx
Copy link
Member Author

/retest

@andyzhangx
Copy link
Member Author

/test pull-kubernetes-e2e-aks-engine-azure

fix comments

fix comment

fix comments

fix comments

fix comments

fix comments

fix bazel
@andyzhangx
Copy link
Member Author

/retest

1 similar comment
@andyzhangx
Copy link
Member Author

/retest

@andyzhangx
Copy link
Member Author

/test pull-kubernetes-e2e-azure-disk-windows

Copy link
Member

@feiskyer feiskyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold (in case @liggitt may want to review again)

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Nov 12, 2020
@feiskyer
Copy link
Member

/retest
(though azuredisk CSI test failures are not related to this PR's changes)

@andyzhangx
Copy link
Member Author

/retest

Comment on lines +301 to +307
// add ACR anonymous repo support: use empty username and password for anonymous access
defaultConfigEntry := credentialprovider.DockerConfigEntry{
Username: "",
Password: "",
Email: dummyRegistryEmail,
}
cfg["*.azurecr.*"] = defaultConfigEntry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case where we're returning a cache entry, this is modifying the shared cache object that was previously returned and can cause a data race.

Suggest making an addAnonymousCredentials(cfg) method and calling it from getFromACR when we're constructing a new cached credential set, and from the else block in Provide() when we're not using UseManagedIdentityExtension. That ensures we never modify a config after we've returned it from Provide()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @liggitt there is no shared cache object anymore, cfg is a newly constructed object every time in Provide() func, see line 253, I have removed the cacheDockerConfig field in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see, thanks.

Copy link
Member

@liggitt liggitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Comment on lines +301 to +307
// add ACR anonymous repo support: use empty username and password for anonymous access
defaultConfigEntry := credentialprovider.DockerConfigEntry{
Username: "",
Password: "",
Email: dummyRegistryEmail,
}
cfg["*.azurecr.*"] = defaultConfigEntry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see, thanks.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andyzhangx
Copy link
Member Author

/hold cancel
@liggitt thanks for the review. Really appreciate that!

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2020
@andyzhangx
Copy link
Member Author

/hold cancel

@k8s-ci-robot k8s-ci-robot merged commit e21ce4b into kubernetes:master Nov 12, 2020
k8s-ci-robot added a commit that referenced this pull request Nov 17, 2020
…6355-upstream-release-1.19

Automated cherry pick of #96355: fix pull image error from multiple ACRs using azure managed
k8s-ci-robot added a commit that referenced this pull request Nov 17, 2020
…6355-upstream-release-1.17

Automated cherry pick of #96355: fix pull image error from multiple ACRs using azure managed
k8s-ci-robot added a commit that referenced this pull request Nov 17, 2020
…6355-upstream-release-1.18

Automated cherry pick of #96355: fix pull image error from multiple ACRs using azure managed
@porrascarlos802018
Copy link

Please update us on this
#92818
says will be addressed on #96355 however , this one at the end says on hold.

Please advise on ETA and actions required to be executed on existing clusters with this MSI problem.

@andyzhangx
Copy link
Member Author

andyzhangx commented Nov 20, 2020

Please update us on this
#92818
says will be addressed on #96355 however , this one at the end says on hold.

Please advise on ETA and actions required to be executed on existing clusters with this MSI problem.

@porrascarlos802018 this fix "pull image auth error from multiple(>1) ACRs using azure managed identity" has been merged in 1.17.14, 1.18.12, 1.19.4, for existing clusters with MSI, wait for a few more tries or using single ACR could workaround

@liggitt liggitt added kind/regression Categorizes issue or PR as related to a regression from a prior release. and removed kind/regression Categorizes issue or PR as related to a regression from a prior release. labels Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pulling Azure Container Registry image using Managed Service Identity may fail
6 participants