kube-proxy has to clear NodePort stale UDP entries #98305

aojea · 2021-01-22T10:59:40Z

What type of PR is this?

/kind bug
/kind failing-test
/kind flake

What this PR does / why we need it:

kube-proxy needs to clean the stale UDP conntrack entries for NodePorts, however,
this was done before the iptables rules were programmed and when the NodePort service was added,
without having into account if endpoints were available.

The conntrack entries have to be cleared after the iptables rules are programmed and when
the service has endpoints available.

Which issue(s) this PR fixes:
Fixes #91236

Special notes for your reviewer:

I´ve found out that the flake failed constantly, if the nodeport port never cleared the conntrack entries (UDP) on the opened socket
#98130 (comment)

We´ve modifed the test to connect in parallel to the ClusterIP, so we can verify that the pods and the services are working correctly, the connection to the ClusterIP works, the conntrack entry is as follows

secctx=system_u:object_r:unlabeled_t:s0 use=1\nudp      17 115 src=10.131.0.73 dst=10.128.2.75 sport=80 dport=12345 src=10.128.2.75 dst=10.131.0.73 sport=12345 dport=80 [ASSURED] mark=0
 no conntrack entry for the nodeport connection (edited)

However, the NodePort connection doesn´t work, and conntrack entry is the following:

secctx=system_u:object_r:unlabeled_t:s0 use=1\nudp      17 28 src=10.128.2.74 dst=10.0.32.3 sport=12345 dport=30801 [UNREPLIED] src=10.0.32.3 dst=10.0.32.2 sport=30801 dport=12345 mark=0

the iptables entries are correctly installed, but the packet is not forwarded because is hitting that conntrack entry,

Does this PR introduce a user-facing change?:

kube-proxy: fix a bug on UDP NodePort Services where stale conntrack entries may blackhole the traffic directed to the NodePort.

k8s-ci-robot · 2021-01-22T10:59:48Z

@aojea: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

aojea · 2021-01-22T11:01:39Z

/assign @thockin @danwinship
/cc @JornShen
/sig network

JornShen · 2021-01-22T11:12:18Z

ipvs mode also need modify? @aojea

aojea · 2021-01-22T11:59:29Z

ipvs mode also need modify? @aojea

/hold
yep, thanks
In the meantime let´s see what Dan and Tim think about this change

thockin · 2021-01-23T00:00:01Z

OK, let's have a real chat about this.

First, I thought that we would not add the nodeport rule if we couldn't open the local port, but that just doesn't seem to be true.

What I am left wondering is - why do we even bother with this? If we can't open it, we just log it and move on. In that case, the very thing we are trying to prevent (colliding with an app in the root netns) is not prevented. We know that apps explicitly using ports in this range are rare. So is this really even worth doing?

What if we just took it all out? Does anyone think this is really worth the effort any more? At the time I thought it was very clever, but I no longer believe it's worth it's on weight.

@aojea - what do you think? Leave the conntrack flush, I think, but all the rest could go, including e2e.

I was going to make the same argument for hostPort - and maybe I will. Kubelet already doesn't re-open those if it restarts, so it's at best "sloppy".

aojea · 2021-01-23T12:49:55Z

@aojea - what do you think? Leave the conntrack flush, I think, but all the rest could go, including e2e.

the conntrack flush has to be there, without it the test for UDP NodePorts always fail, at least is what we did observe in #98130 (comment)

First, I thought that we would not add the nodeport rule if we couldn't open the local port, but that just doesn't seem to be true.

indeed, this surprised me too, but thinking more about that, you can't know in the apiserver that the port is free on the nodes, so it will be weird for an user to be able to create a NodePort and then that kube-proxy refused to create it.

I was going to make the same argument for hostPort - and maybe I will. Kubelet already doesn't re-open those if it restarts, so it's at best "sloppy".

I had this discussion with @danwinship , and the conclusion I took from that conversation is that this is needed due to security, but iptables rules take precedence over the application using the port, so the traffic will be forwarded to the NodePort/HostPort and the application will be "hijacked" by the Service.
I think that this is the security problem that the port binding tries to mitigate, but as you say is not true, because we only log and install the rules anyway ... but I have the feeling that I'm missing something , maybe Dan can explain more,

thockin · 2021-01-23T20:10:43Z

you can't know in the apiserver that the port is free on the nodes, so it

will be weird for an user to be able to create a NodePort and then that kube-proxy refused to create it Exactly. At some point you have to ask who is in charge. In a kube cluster, I think the kube API is in charge. So really, the best we can do here is warn. Maybe we could warn more loudly (events?) but I don't think we're willing to NOT claim a nodeport if a service says to. The initial objective was security (ish). "If there's something unexpected on that port, don't hijack it". Then it became "...and prevent anyone from trying to use it in the future". The assumption there is that we won't hijack it. But we do. And we know nobody really looks at the logs. So all this logic is effectively pointless (IMO). If the goal was simply to log something if a port is already held, all we'd really need to do is try to claim it, log if error, else close it. If there was a simpler way to hold ports open, I might be more supportive of this, but really NodePorts in the 30K range seem like a user error. Pod hostports are the bigger liability and frankly, are almost as bad a feature as externalIPs. So would it be net simpler, and just as effective, to do less work here?

…

On Sat, Jan 23, 2021 at 4:50 AM Antonio Ojea ***@***.***> wrote: @aojea <https://github.com/aojea> - what do you think? Leave the conntrack flush, I think, but all the rest could go, including e2e. the conntrack flush has to be there, without it the test for UDP NodePorts always fail, at least is what we did observe in #98130 (comment) <#98130 (comment)> First, I thought that we would not add the nodeport rule if we couldn't open the local port, but that just doesn't seem to be true. indeed, this surprised me too, but thinking more about that, you can't know in the apiserver that the port is free on the nodes, so it will be weird for an user to be able to create a NodePort and then that kube-proxy refused to create it. I was going to make the same argument for hostPort - and maybe I will. Kubelet already doesn't re-open those if it restarts, so it's at best "sloppy". I had this discussion with @danwinship <https://github.com/danwinship> , and the conclusion I took from that conversation is that this is needed due to security, but iptables rules take precedence over the application using the port, so the traffic will be forwarded to the NodePort/HostPort and the application will be "hijacked" by the Service. I think that this is the security problem that the port binding tries to mitigate, but as you say is not true, because we only log and install the rules anyway ... but I have the feeling that I'm missing something , maybe Dan can explain more, — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#98305 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKWAVHPCPEYOKO2OFHBFETS3LAYBANCNFSM4WOJ4ENA> .

JornShen · 2021-01-24T03:02:29Z

will be weird for an user to be able to create a NodePort and then that
kube-proxy refused to create it

agree with @aojea

I think we can add event report local port failed opened. @thockin
Because at least we can let user know what happen in the node. And seems external ip do the same

kubernetes/pkg/proxy/iptables/proxier.go

Lines 1113 to 1127 in fb276ce

    
           socket, err := proxier.portMapper.OpenLocalPort(&lp, isIPv6) 
        
           if err != nil { 
        
           	msg := fmt.Sprintf("can't open %s, skipping this externalIP: %v", lp.String(), err) 
        
           	proxier.recorder.Eventf( 
        
           		&v1.ObjectReference{ 
        
           			Kind:      "Node", 
        
           			Name:      proxier.hostname, 
        
           			UID:       types.UID(proxier.hostname), 
        
           			Namespace: "", 
        
           		}, v1.EventTypeWarning, err.Error(), msg) 
        
           	klog.ErrorS(err, "can't open port, skipping externalIP", "port", lp.String()) 
        
           	continue 
        
           } 
        
           replacementPortsMap[lp] = socket

And

a failure on programming the iptables rules, still
keep the socket open, also that traffic to the nodeport is consumed

seems not right, the kube-proxy will reclose it.

kubernetes/pkg/proxy/iptables/proxier.go

Lines 1600 to 1609 in fb276ce

    
           klog.V(5).InfoS("Restoring iptables", "rules", proxier.iptablesData.Bytes()) 
        
           err = proxier.iptables.RestoreAll(proxier.iptablesData.Bytes(), utiliptables.NoFlushTables, utiliptables.RestoreCounters) 
        
           if err != nil { 
        
           	klog.ErrorS(err, "Failed to execute iptables-restore") 
        
           	metrics.IptablesRestoreFailuresTotal.Inc() 
        
           	// Revert new local ports. 
        
           	klog.V(2).InfoS("Closing local ports after iptables-restore failure") 
        
           	utilproxy.RevertPorts(replacementPortsMap, proxier.portsMap) 
        
           	return 
        
           }

look at line 1607. So here we adjust the order of open the port seems make no much difference.

if I understood fault, plz point out @aojea :)

danwinship · 2021-01-26T19:01:18Z

We know that apps explicitly using ports in this range are rare. So is this really even worth doing?

Were nodeports always limited to a range? Or did that come later and so the port opener used to make more sense?

I was going to make the same argument for hostPort - and maybe I will.

Do we limit hostports to a specific range? We need to prevent people from using hostports to snipe nodeports (and system services) (and hostnetwork pods).

At some point you have to ask who is in charge. In a kube cluster, I think the kube API is in charge.

So that's interesting. In OpenShift 4.x, we have pretty much declared that you are not allowed to run your own non-containerized software on an OpenShift node, but I had assumed that people didn't feel that way about kubernetes clusters in general. Eg, kubernetes explicitly takes into account the fact that another piece of software on the system may accidentally delete its iptables rules out from under it; and there was some discussion in #92938 about whether it's reasonable for kube-proxy to change the value of host-wide kernel networking sysctls.

If the correct response to "I installed software behind kubernetes's back and then kubernetes stole its port away from it" is "Don't do that then!" then yeah, the port opener seems a lot less necessary.

(And maybe there could be some way to explicitly indicate to kubernetes that, eg, you're using port 22 for something, so stay away?)

aojea · 2021-01-28T00:11:09Z

If the correct response to "I installed software behind kubernetes's back and then kubernetes stole its port away from it" is "Don't do that then!" then yeah, the port opener seems a lot less necessary.

but why holding the port open? do we want to prevent people accidentally opening services behind forwarded ports?

JornShen · 2021-01-28T01:35:02Z

If the correct response to "I installed software behind kubernetes's back and then kubernetes stole its port away from it" is "Don't do that then!" then yeah, the port opener seems a lot less necessary.

but why holding the port open? do we want to prevent people accidentally opening services behind forwarded ports?

@aojea

look at the notes

kubernetes/pkg/proxy/iptables/proxier.go

Lines 1099 to 1108 in b3dd01d

    
           // If the "external" IP happens to be an IP that is local to this 
        
           // machine, hold the local port open so no other process can open it 
        
           // (because the socket might open but it would never work). 
        
           if (svcInfo.Protocol() != v1.ProtocolSCTP) && localAddrSet.Has(net.ParseIP(externalIP)) { 
        
           	lp := utilproxy.LocalPort{ 
        
           		Description: "externalIP for " + svcNameString, 
        
           		IP:          externalIP, 
        
           		Port:        svcInfo.Port(), 
        
           		Protocol:    protocol, 
        
           	}

I think opening the port is a protective measures.

aojea · 2021-01-28T11:51:27Z

I think opening the port is a protective measures.

yeah, you are right, but that is for ExternalIP that also has its own CVE 😄 .

What about make the port holding configurable, something like Selinux:

Enforce: it keeps the socket open (not install the rule if it can not open it)
Permissive: try to open a socket to check if there is something and log (current behavior) or just log and don´t hold the socket open (I prefer this and set it as default)
Disabled: not try to hold at all

aojea · 2021-02-01T22:56:54Z

ok, my hunch was right but I was wrong about the reason, it is not the socket thing, it is the the conntrack cleaning who was to be done after the rules are programmed, or it can happen that during the time that we clean conntrack for the UDP NodePort and the rules are programmed a conntrack entry is inserted.
This entry goes stale and is renewed with the new connections, hence the test always fails to reach the first pod, because it nevers get forwareded by the iptables rules

I just need to confirm that this actually fixes the problem, but we have a way to reproduce it now, so I will test it tomorrow

This PR. moves the code to clean conntrack after the iptables rules are programmed, so it should solve the issue too.

cc: @smarterclayton

pkg/proxy/iptables/proxier.go

Clear conntrack entries for UDP NodePorts, this has to be done AFTER the iptables rules are programmed. It can happen that traffic to the NodePort hits the host before the iptables rules are programmed this will create an stale entry in conntrack that will blackhole the traffic, so we need to clear it ONLY when the service has endpoints.

danwinship · 2021-02-10T16:35:25Z

/lgtm
/approve

k8s-ci-robot · 2021-02-10T16:35:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/proxy/OWNERS~~ [danwinship]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2021-02-10T19:06:57Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

aojea · 2021-02-10T22:39:57Z

/retest

aojea · 2021-02-10T23:15:48Z

/retest
[sig-storage] PersistentVolumes-local [Volume type: dir-link

fejta-bot · 2021-02-11T01:45:45Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

fejta-bot · 2021-02-11T04:33:49Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

aojea · 2021-02-11T07:12:38Z

/retest

Many flakes lately ....

wojtek-t · 2021-02-11T08:51:45Z

@aojea - great finding - thanks a lot for debugging and fixing this!

ialidzhikov · 2021-02-11T20:48:42Z

@aojea , does it makes sense to have cherry-picks for this PR?

aojea · 2021-02-11T21:01:42Z

@aojea , does it makes sense to have cherry-picks for this PR?

yes it does, are you volunteering :) ?
Otherwise I will try to do it next week

aojea · 2021-02-12T08:30:11Z

@ialidzhikov FYI #99017

ialidzhikov · 2021-02-12T08:33:03Z

@aojea , does it makes sense to have cherry-picks for this PR?

yes it does, are you volunteering :) ?
Otherwise I will try to do it next week

Thanks for asking @aojea , but I don't think I have the required expertise in this area. I just wanted to make sure that we don't miss to backport a PR that deserves a backport. :)
Thank you for taking care!

…upstream-release-1.20 Automated cherry pick of #98305: kube-proxy: clear conntrack entries after rules are in place

k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 22, 2021

k8s-ci-robot requested review from rramkumar1 and smarterclayton January 22, 2021 11:00

k8s-ci-robot assigned danwinship and thockin Jan 22, 2021

k8s-ci-robot requested a review from JornShen January 22, 2021 11:01

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 22, 2021

danwinship reviewed Feb 10, 2021

View reviewed changes

pkg/proxy/iptables/proxier.go Outdated Show resolved Hide resolved

aojea force-pushed the holdports branch from 206aacc to ed21a0e Compare February 10, 2021 15:22

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2021

k8s-ci-robot merged commit 659b4dc into kubernetes:master Feb 11, 2021

k8s-ci-robot added this to the v1.21 milestone Feb 11, 2021

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Feb 11, 2021

This was referenced Feb 11, 2021

kube-proxy: clear conntrack entries after rules are in place openshift/kubernetes#560

Closed

Automated cherry pick of #98305: kube-proxy: clear conntrack entries after rules are in place #99017

Merged

k8s-ci-robot added a commit that referenced this pull request Mar 2, 2021

Merge pull request #99017 from aojea/automated-cherry-pick-of-#98305-…

1e3c47e

…upstream-release-1.20 Automated cherry pick of #98305: kube-proxy: clear conntrack entries after rules are in place

openshift-ci-robot mentioned this pull request Mar 17, 2021

UPSTREAM: 98305: kube-proxy: clear conntrack entries after rules are in place openshift/kubernetes#602

Merged

openshift-ci-robot mentioned this pull request Apr 27, 2021

UPSTREAM: 98305: kube-proxy: clear conntrack openshift/kubernetes#711

Merged

openshift-ci bot mentioned this pull request May 21, 2021

Bug 1923231: rebase to sdn-4.7-kubernetes-1.20.0-rc.0 openshift/sdn#286

Merged

aojea mentioned this pull request Jun 5, 2021

kube-proxy/iptables, client continuous to send packets, udp server pod restart cause UDP conntrack entry error #102618

Closed

dkulchinsky mentioned this pull request Jul 22, 2021

UDP Pod behind a LoadBalancer Service (External IP) replacing another Pod in the same node not receiving traffic #103857

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-proxy has to clear NodePort stale UDP entries #98305

kube-proxy has to clear NodePort stale UDP entries #98305

aojea commented Jan 22, 2021 •

edited

k8s-ci-robot commented Jan 22, 2021

aojea commented Jan 22, 2021

JornShen commented Jan 22, 2021

aojea commented Jan 22, 2021

thockin commented Jan 23, 2021

aojea commented Jan 23, 2021

thockin commented Jan 23, 2021 via email

JornShen commented Jan 24, 2021

danwinship commented Jan 26, 2021

aojea commented Jan 28, 2021

JornShen commented Jan 28, 2021

aojea commented Jan 28, 2021 •

edited

aojea commented Feb 1, 2021 •

edited

danwinship commented Feb 10, 2021

k8s-ci-robot commented Feb 10, 2021

fejta-bot commented Feb 10, 2021

aojea commented Feb 10, 2021

aojea commented Feb 10, 2021

fejta-bot commented Feb 11, 2021

fejta-bot commented Feb 11, 2021

aojea commented Feb 11, 2021

wojtek-t commented Feb 11, 2021

ialidzhikov commented Feb 11, 2021

aojea commented Feb 11, 2021

aojea commented Feb 12, 2021

ialidzhikov commented Feb 12, 2021 •

edited

kube-proxy has to clear NodePort stale UDP entries #98305

kube-proxy has to clear NodePort stale UDP entries #98305

Conversation

aojea commented Jan 22, 2021 • edited

k8s-ci-robot commented Jan 22, 2021

aojea commented Jan 22, 2021

JornShen commented Jan 22, 2021

aojea commented Jan 22, 2021

thockin commented Jan 23, 2021

aojea commented Jan 23, 2021

thockin commented Jan 23, 2021 via email

JornShen commented Jan 24, 2021

danwinship commented Jan 26, 2021

aojea commented Jan 28, 2021

JornShen commented Jan 28, 2021

aojea commented Jan 28, 2021 • edited

aojea commented Feb 1, 2021 • edited

danwinship commented Feb 10, 2021

k8s-ci-robot commented Feb 10, 2021

fejta-bot commented Feb 10, 2021

aojea commented Feb 10, 2021

aojea commented Feb 10, 2021

fejta-bot commented Feb 11, 2021

fejta-bot commented Feb 11, 2021

aojea commented Feb 11, 2021

wojtek-t commented Feb 11, 2021

ialidzhikov commented Feb 11, 2021

aojea commented Feb 11, 2021

aojea commented Feb 12, 2021

ialidzhikov commented Feb 12, 2021 • edited

aojea commented Jan 22, 2021 •

edited

aojea commented Jan 28, 2021 •

edited

aojea commented Feb 1, 2021 •

edited

ialidzhikov commented Feb 12, 2021 •

edited