Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-proxy has to clear NodePort stale UDP entries #98305

Merged
merged 1 commit into from Feb 11, 2021

Conversation

aojea
Copy link
Member

@aojea aojea commented Jan 22, 2021

What type of PR is this?

/kind bug
/kind failing-test
/kind flake

What this PR does / why we need it:

kube-proxy needs to clean the stale UDP conntrack entries for NodePorts, however,
this was done before the iptables rules were programmed and when the NodePort service was added,
without having into account if endpoints were available.

The conntrack entries have to be cleared after the iptables rules are programmed and when
the service has endpoints available.

Which issue(s) this PR fixes:
Fixes #91236

Special notes for your reviewer:

I´ve found out that the flake failed constantly, if the nodeport port never cleared the conntrack entries (UDP) on the opened socket
#98130 (comment)

We´ve modifed the test to connect in parallel to the ClusterIP, so we can verify that the pods and the services are working correctly, the connection to the ClusterIP works, the conntrack entry is as follows

secctx=system_u:object_r:unlabeled_t:s0 use=1\nudp      17 115 src=10.131.0.73 dst=10.128.2.75 sport=80 dport=12345 src=10.128.2.75 dst=10.131.0.73 sport=12345 dport=80 [ASSURED] mark=0
 no conntrack entry for the nodeport connection (edited) 

However, the NodePort connection doesn´t work, and conntrack entry is the following:

secctx=system_u:object_r:unlabeled_t:s0 use=1\nudp      17 28 src=10.128.2.74 dst=10.0.32.3 sport=12345 dport=30801 [UNREPLIED] src=10.0.32.3 dst=10.0.32.2 sport=30801 dport=12345 mark=0

the iptables entries are correctly installed, but the packet is not forwarded because is hitting that conntrack entry,

Does this PR introduce a user-facing change?:

kube-proxy: fix a bug on UDP NodePort Services where stale conntrack entries may blackhole the traffic directed to the NodePort.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 22, 2021
@k8s-ci-robot
Copy link
Contributor

@aojea: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 22, 2021
@aojea
Copy link
Member Author

aojea commented Jan 22, 2021

/assign @thockin @danwinship
/cc @JornShen
/sig network

@JornShen
Copy link
Member

ipvs mode also need modify? @aojea

@aojea
Copy link
Member Author

aojea commented Jan 22, 2021

ipvs mode also need modify? @aojea

/hold
yep, thanks
In the meantime let´s see what Dan and Tim think about this change

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 22, 2021
@thockin
Copy link
Member

thockin commented Jan 23, 2021

OK, let's have a real chat about this.

First, I thought that we would not add the nodeport rule if we couldn't open the local port, but that just doesn't seem to be true.

What I am left wondering is - why do we even bother with this? If we can't open it, we just log it and move on. In that case, the very thing we are trying to prevent (colliding with an app in the root netns) is not prevented. We know that apps explicitly using ports in this range are rare. So is this really even worth doing?

What if we just took it all out? Does anyone think this is really worth the effort any more? At the time I thought it was very clever, but I no longer believe it's worth it's on weight.

@aojea - what do you think? Leave the conntrack flush, I think, but all the rest could go, including e2e.

I was going to make the same argument for hostPort - and maybe I will. Kubelet already doesn't re-open those if it restarts, so it's at best "sloppy".

@aojea
Copy link
Member Author

aojea commented Jan 23, 2021

@aojea - what do you think? Leave the conntrack flush, I think, but all the rest could go, including e2e.

the conntrack flush has to be there, without it the test for UDP NodePorts always fail, at least is what we did observe in #98130 (comment)

First, I thought that we would not add the nodeport rule if we couldn't open the local port, but that just doesn't seem to be true.

indeed, this surprised me too, but thinking more about that, you can't know in the apiserver that the port is free on the nodes, so it will be weird for an user to be able to create a NodePort and then that kube-proxy refused to create it.

I was going to make the same argument for hostPort - and maybe I will. Kubelet already doesn't re-open those if it restarts, so it's at best "sloppy".

I had this discussion with @danwinship , and the conclusion I took from that conversation is that this is needed due to security, but iptables rules take precedence over the application using the port, so the traffic will be forwarded to the NodePort/HostPort and the application will be "hijacked" by the Service.
I think that this is the security problem that the port binding tries to mitigate, but as you say is not true, because we only log and install the rules anyway ... but I have the feeling that I'm missing something , maybe Dan can explain more,

@thockin
Copy link
Member

thockin commented Jan 23, 2021 via email

@JornShen
Copy link
Member

will be weird for an user to be able to create a NodePort and then that
kube-proxy refused to create it

agree with @aojea

I think we can add event report local port failed opened. @thockin
Because at least we can let user know what happen in the node. And seems external ip do the same

socket, err := proxier.portMapper.OpenLocalPort(&lp, isIPv6)
if err != nil {
msg := fmt.Sprintf("can't open %s, skipping this externalIP: %v", lp.String(), err)
proxier.recorder.Eventf(
&v1.ObjectReference{
Kind: "Node",
Name: proxier.hostname,
UID: types.UID(proxier.hostname),
Namespace: "",
}, v1.EventTypeWarning, err.Error(), msg)
klog.ErrorS(err, "can't open port, skipping externalIP", "port", lp.String())
continue
}
replacementPortsMap[lp] = socket

And

a failure on programming the iptables rules, still
keep the socket open, also that traffic to the nodeport is consumed

seems not right, the kube-proxy will reclose it.

klog.V(5).InfoS("Restoring iptables", "rules", proxier.iptablesData.Bytes())
err = proxier.iptables.RestoreAll(proxier.iptablesData.Bytes(), utiliptables.NoFlushTables, utiliptables.RestoreCounters)
if err != nil {
klog.ErrorS(err, "Failed to execute iptables-restore")
metrics.IptablesRestoreFailuresTotal.Inc()
// Revert new local ports.
klog.V(2).InfoS("Closing local ports after iptables-restore failure")
utilproxy.RevertPorts(replacementPortsMap, proxier.portsMap)
return
}

look at line 1607. So here we adjust the order of open the port seems make no much difference.

if I understood fault, plz point out @aojea :)

@danwinship
Copy link
Contributor

We know that apps explicitly using ports in this range are rare. So is this really even worth doing?

Were nodeports always limited to a range? Or did that come later and so the port opener used to make more sense?

I was going to make the same argument for hostPort - and maybe I will.

Do we limit hostports to a specific range? We need to prevent people from using hostports to snipe nodeports (and system services) (and hostnetwork pods).

At some point you have to ask who is in charge. In a kube cluster, I think the kube API is in charge.

So that's interesting. In OpenShift 4.x, we have pretty much declared that you are not allowed to run your own non-containerized software on an OpenShift node, but I had assumed that people didn't feel that way about kubernetes clusters in general. Eg, kubernetes explicitly takes into account the fact that another piece of software on the system may accidentally delete its iptables rules out from under it; and there was some discussion in #92938 about whether it's reasonable for kube-proxy to change the value of host-wide kernel networking sysctls.

If the correct response to "I installed software behind kubernetes's back and then kubernetes stole its port away from it" is "Don't do that then!" then yeah, the port opener seems a lot less necessary.

(And maybe there could be some way to explicitly indicate to kubernetes that, eg, you're using port 22 for something, so stay away?)

@aojea
Copy link
Member Author

aojea commented Jan 28, 2021

If the correct response to "I installed software behind kubernetes's back and then kubernetes stole its port away from it" is "Don't do that then!" then yeah, the port opener seems a lot less necessary.

but why holding the port open? do we want to prevent people accidentally opening services behind forwarded ports?

@JornShen
Copy link
Member

If the correct response to "I installed software behind kubernetes's back and then kubernetes stole its port away from it" is "Don't do that then!" then yeah, the port opener seems a lot less necessary.

but why holding the port open? do we want to prevent people accidentally opening services behind forwarded ports?

@aojea

look at the notes

// If the "external" IP happens to be an IP that is local to this
// machine, hold the local port open so no other process can open it
// (because the socket might open but it would never work).
if (svcInfo.Protocol() != v1.ProtocolSCTP) && localAddrSet.Has(net.ParseIP(externalIP)) {
lp := utilproxy.LocalPort{
Description: "externalIP for " + svcNameString,
IP: externalIP,
Port: svcInfo.Port(),
Protocol: protocol,
}

I think opening the port is a protective measures.

@aojea
Copy link
Member Author

aojea commented Jan 28, 2021

I think opening the port is a protective measures.

yeah, you are right, but that is for ExternalIP that also has its own CVE 😄 .

What about make the port holding configurable, something like Selinux:

  • Enforce: it keeps the socket open (not install the rule if it can not open it)
  • Permissive: try to open a socket to check if there is something and log (current behavior) or just log and don´t hold the socket open (I prefer this and set it as default)
  • Disabled: not try to hold at all

@aojea
Copy link
Member Author

aojea commented Feb 1, 2021

ok, my hunch was right but I was wrong about the reason, it is not the socket thing, it is the the conntrack cleaning who was to be done after the rules are programmed, or it can happen that during the time that we clean conntrack for the UDP NodePort and the rules are programmed a conntrack entry is inserted.
This entry goes stale and is renewed with the new connections, hence the test always fails to reach the first pod, because it nevers get forwareded by the iptables rules

I just need to confirm that this actually fixes the problem, but we have a way to reproduce it now, so I will test it tomorrow

This PR. moves the code to clean conntrack after the iptables rules are programmed, so it should solve the issue too.

cc: @smarterclayton

Clear conntrack entries for UDP NodePorts,
this has to be done AFTER the iptables rules are programmed.
It can happen that traffic to the NodePort hits the host before
the iptables rules are programmed this will create an stale entry
in conntrack that will blackhole the traffic, so we need to
clear it ONLY when the service has endpoints.
@danwinship
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2021
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@aojea
Copy link
Member Author

aojea commented Feb 10, 2021

/retest

@aojea
Copy link
Member Author

aojea commented Feb 10, 2021

/retest
[sig-storage] PersistentVolumes-local [Volume type: dir-link

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@aojea
Copy link
Member Author

aojea commented Feb 11, 2021

/retest

Many flakes lately ....

@k8s-ci-robot k8s-ci-robot merged commit 659b4dc into kubernetes:master Feb 11, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 11, 2021
@wojtek-t
Copy link
Member

@aojea - great finding - thanks a lot for debugging and fixing this!

@ialidzhikov
Copy link
Contributor

@aojea , does it makes sense to have cherry-picks for this PR?

@aojea
Copy link
Member Author

aojea commented Feb 11, 2021

@aojea , does it makes sense to have cherry-picks for this PR?

yes it does, are you volunteering :) ?
Otherwise I will try to do it next week

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Feb 11, 2021
@aojea
Copy link
Member Author

aojea commented Feb 12, 2021

@ialidzhikov FYI #99017

@ialidzhikov
Copy link
Contributor

ialidzhikov commented Feb 12, 2021

@aojea , does it makes sense to have cherry-picks for this PR?

yes it does, are you volunteering :) ?
Otherwise I will try to do it next week

Thanks for asking @aojea , but I don't think I have the required expertise in this area. I just wanted to make sure that we don't miss to backport a PR that deserves a backport. :)
Thank you for taking care!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ipvs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
9 participants