Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make podIPs order match node IP family preference #97979

Merged
merged 1 commit into from Mar 5, 2021

Conversation

danwinship
Copy link
Contributor

@danwinship danwinship commented Jan 12, 2021

What type of PR is this?
/kind bug

What this PR does / why we need it:
Kubelet currently sets pod.status.podIPs directly from the array of IPs returned from the runtime. But there are two problems with this:

  1. pod.status.podIPs is restricted to be at most one IPv4 and at most one IPv6 IP, but the list of IPs from CRI/CNI has no such restriction; if the runtime returned, eg, 3 IPv4 addresses, then kubelet would try to set podIPs to that and then fail.
  2. kubelet is assuming that the IPs are returned in preference order, but CRI/CNI does not actually specify that, and the consensus of the CNI maintainers in DualStack: Ip family of status.podIP can be wrong depending on CNI-plugin #95251 seems to be that we should not be assuming that; kubelet should treat the list as unordered, and impose its own logic on them. (As a concrete example of how this causes problems; Cilium always returns the IPv6 pod IP first if it is configured to do dual-stack, which currently means any Cilium dual-stack kubernetes cluster ends up having IPv6-primary pod IPs, whether the administrator wants that or not.) Although this technically doesn't break anything, it's considered ugly and confusing. (Maybe a bigger problem is that if the CNI plugin returns dual-stack pod IPs in a single-stack cluster, kubelet might end up grabbing a pod IP that is incompatible with the cluster configuration.)

This PR fixes kubelet to always pick an IP of the same family as its --node-ip for the first pod IP, and an IP of the opposite family (if any) for the second pod IP, and ignore any other IPs from the runtime.

Which issue(s) this PR fixes:
Fixes #95251

Does this PR introduce a user-facing change?:

When a CNI plugin returns dual-stack pod IPs, kubelet will now try to respect the
"primary IP family" of the cluster by picking a primary pod IP of the same family
as the (primary) node IP, rather than assuming that the CNI plugin returned the IPs
in the order the administrator wanted (since some CNI plugins don't allow
configuring this).

/sig network
/priority important-longterm
/cc @uablrek @aojea @khenidak
/auto-cc

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/network Categorizes an issue or PR as relevant to SIG Network. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jan 12, 2021
@ehashman ehashman added this to Needs Reviewer in SIG Node PR Triage Jan 12, 2021
@krmayankk
Copy link

So this issue should only be observed in dual stack configurations ?

@uablrek
Copy link
Contributor

uablrek commented Jan 13, 2021

No, as @danwinship points out;

(Maybe a bigger problem is that if the CNI plugin returns dual-stack pod IPs in a single-stack cluster, kubelet might end up grabbing a pod IP that is incompatible with the cluster configuration.)

So in a single stack cluster with primary family IPv4 where the CNI-plugin returns two addresses but the IPv6 address first you get;

  09:02:35 Server Version: v1.20.1
  09:02:35 CNI-plugin; bridge
  09:02:36 Single-stack cluster. BaseFamily=IPv4, phase.3 dual-stack API
$ kubectl get pod kahttp-6wqm8 -o wide
NAME           READY   STATUS    RESTARTS   AGE     IP          NODE     NOMINATED NODE   READINESS GATES
kahttp-6wqm8   1/1     Running   0          3m15s   1100::402   vm-004   <none>           <none>

NOTE this is a single stack IPv4 cluster with; --feature-gates=IPv6DualStack=false,SCTPSupport=true.

I use bridge CNI-plugin since it allows control of the order of addresses.

This is not as bad for K8s as it looks since endpointslices saves the day;

$ kubectl get endpoints kahttp
NAME     ENDPOINTS                                                     AGE
kahttp   [1100::104]:443,[1100::202]:443,[1100::303]:443 + 5 more...   4m42s
$ kubectl get endpointslices | grep kahttp
kahttp-lt8hr               IPv4          443,80           11.0.2.2,11.0.3.3,11.0.4.2 + 1 more...            5m

It is probably a disaster for other programs that looks at podIP and endpoints, for instance a service mesh.

@uablrek
Copy link
Contributor

uablrek commented Jan 13, 2021

Same thing with this PR applied;

  09:23:36 Server Version: v1.21.0-alpha.0.786+a098b5b16c9a7f-dirty
  09:23:36 CNI-plugin; bridge
  09:23:36 Single-stack cluster. BaseFamily=IPv4, phase.3 dual-stack API
$ kubectl get pod kahttp-88cdx -o wide
NAME           READY   STATUS    RESTARTS   AGE    IP         NODE     NOMINATED NODE   READINESS GATES
kahttp-88cdx   1/1     Running   0          2m3s   11.0.2.3   vm-002   <none>           <none>
$ kubectl get endpoints kahttp
NAME     ENDPOINTS                                            AGE
kahttp   11.0.1.3:443,11.0.2.3:443,11.0.3.3:443 + 5 more...   2m18s
$ kubectl get endpointslices | grep kahttp
kahttp-48l5j               IPv4          443,80           11.0.2.3,11.0.4.2,11.0.3.3 + 1 more...            2m26s

Copy link
Contributor

@uablrek uablrek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aojea
Copy link
Member

aojea commented Jan 13, 2021

pull-kubernetes-verify — Job failed.

./hack/update-bazel.sh
./hack/update-gofmt.sh

it LGTM, should we document this somewhere else?

kubelet is assuming that the IPs are returned in preference order, but CRI/CNI does not actually specify that, and the consensus of the CNI maintainers in #95251 seems to be that we should not be assuming that; kubelet should treat the list as unordered, and impose its own logic on them.

@danwinship
Copy link
Contributor Author

pull-kubernetes-verify — Job failed.

sigh, I always forget that

it LGTM, should we document this somewhere else?

it's really a CNI thing, not a kubernetes thing...

@aojea
Copy link
Member

aojea commented Jan 14, 2021

it's really a CNI thing, not a kubernetes thing...

I mean the kubelet behaviour

In dual-stack clusters, kubelet will now try to respect the "primary IP family" of the
cluster when setting pod.status.podIPs, by picking a primary pod IP of the same family
as the node IP rather than assuming that the CNI plugin returned the IPs in the order
the administrator wanted.

@ehashman
Copy link
Member

/retest
flake

I'd be +1 to documenting that the order output here matters and why.

Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@@ -2613,3 +2614,137 @@ func TestGenerateAPIPodStatusHostNetworkPodIPs(t *testing.T) {
})
}
}

func TestGenerateAPIPodStatusPodIPs(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great tests \o/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 great coverage

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 19, 2021
@ehashman
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 19, 2021
@aojea
Copy link
Member

aojea commented Feb 6, 2021

IMO this must go into K8s v1.21 for dual-stack to go "beta". Enhancement freeze at Feb 9.

totally agree

@aojea
Copy link
Member

aojea commented Feb 13, 2021

/assign @thockin
for final approval

@derekwaynecarr
Copy link
Member

the kubelet changes look good to me for sig-node. ordering makes sense here.

/approve

@derekwaynecarr derekwaynecarr moved this from Needs Approver to Done in SIG Node PR Triage Mar 4, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aanm, danwinship, derekwaynecarr, uablrek

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 4, 2021
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to rebase and nuke BUILD

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2021
@aojea
Copy link
Member

aojea commented Mar 4, 2021

/lgtm
it was already lgtmed before the rebase and there is a wide consensus

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2021
@k8s-ci-robot k8s-ci-robot merged commit 9f451c0 into kubernetes:master Mar 5, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Mar 5, 2021
@uablrek
Copy link
Contributor

uablrek commented Mar 5, 2021

Great 👍

dseomn added a commit to dseomn/system-configs that referenced this pull request Apr 9, 2021
This works around rook/rook#7522 by making
podIP use the IPv6 address instead of the IPv4 address. As of kubernetes
1.21 with kubernetes/kubernetes#97979, it looks
like the podIP family will match the node IP family. The node IP still
defaults to IPv4, but specifying it manually works.
@danwinship danwinship deleted the pod-ip-family branch May 3, 2021 22:08
@Jean-Daniel
Copy link

For the record, this fix is only a partial fix, as it misses pod creation case: #103263

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

DualStack: Ip family of status.podIP can be wrong depending on CNI-plugin
10 participants