New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-proxy has to clear NodePort stale UDP entries #98305
Conversation
@aojea: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @thockin @danwinship |
ipvs mode also need modify? @aojea |
/hold |
OK, let's have a real chat about this. First, I thought that we would not add the nodeport rule if we couldn't open the local port, but that just doesn't seem to be true. What I am left wondering is - why do we even bother with this? If we can't open it, we just log it and move on. In that case, the very thing we are trying to prevent (colliding with an app in the root netns) is not prevented. We know that apps explicitly using ports in this range are rare. So is this really even worth doing? What if we just took it all out? Does anyone think this is really worth the effort any more? At the time I thought it was very clever, but I no longer believe it's worth it's on weight. @aojea - what do you think? Leave the conntrack flush, I think, but all the rest could go, including e2e. I was going to make the same argument for hostPort - and maybe I will. Kubelet already doesn't re-open those if it restarts, so it's at best "sloppy". |
the conntrack flush has to be there, without it the test for UDP NodePorts always fail, at least is what we did observe in #98130 (comment)
indeed, this surprised me too, but thinking more about that, you can't know in the apiserver that the port is free on the nodes, so it will be weird for an user to be able to create a NodePort and then that kube-proxy refused to create it.
I had this discussion with @danwinship , and the conclusion I took from that conversation is that this is needed due to security, but iptables rules take precedence over the application using the port, so the traffic will be forwarded to the NodePort/HostPort and the application will be "hijacked" by the Service. |
you can't know in the apiserver that the port is free on the nodes, so it
will be weird for an user to be able to create a NodePort and then that
kube-proxy refused to create it
Exactly. At some point you have to ask who is in charge. In a kube
cluster, I think the kube API is in charge. So really, the best we can do
here is warn. Maybe we could warn more loudly (events?) but I don't think
we're willing to NOT claim a nodeport if a service says to.
The initial objective was security (ish). "If there's something unexpected
on that port, don't hijack it". Then it became "...and prevent anyone from
trying to use it in the future". The assumption there is that we won't
hijack it. But we do. And we know nobody really looks at the logs. So
all this logic is effectively pointless (IMO). If the goal was simply to
log something if a port is already held, all we'd really need to do is try
to claim it, log if error, else close it.
If there was a simpler way to hold ports open, I might be more supportive
of this, but really NodePorts in the 30K range seem like a user error. Pod
hostports are the bigger liability and frankly, are almost as bad a feature
as externalIPs.
So would it be net simpler, and just as effective, to do less work here?
…On Sat, Jan 23, 2021 at 4:50 AM Antonio Ojea ***@***.***> wrote:
@aojea <https://github.com/aojea> - what do you think? Leave the
conntrack flush, I think, but all the rest could go, including e2e.
the conntrack flush has to be there, without it the test for UDP NodePorts
always fail, at least is what we did observe in #98130 (comment)
<#98130 (comment)>
First, I thought that we would not add the nodeport rule if we couldn't
open the local port, but that just doesn't seem to be true.
indeed, this surprised me too, but thinking more about that, you can't
know in the apiserver that the port is free on the nodes, so it will be
weird for an user to be able to create a NodePort and then that kube-proxy
refused to create it.
I was going to make the same argument for hostPort - and maybe I will.
Kubelet already doesn't re-open those if it restarts, so it's at best
"sloppy".
I had this discussion with @danwinship <https://github.com/danwinship> ,
and the conclusion I took from that conversation is that this is needed due
to security, but iptables rules take precedence over the application using
the port, so the traffic will be forwarded to the NodePort/HostPort and the
application will be "hijacked" by the Service.
I think that this is the security problem that the port binding tries to
mitigate, but as you say is not true, because we only log and install the
rules anyway ... but I have the feeling that I'm missing something , maybe
Dan can explain more,
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#98305 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKWAVHPCPEYOKO2OFHBFETS3LAYBANCNFSM4WOJ4ENA>
.
|
agree with @aojea I think we can add event report local port failed opened. @thockin kubernetes/pkg/proxy/iptables/proxier.go Lines 1113 to 1127 in fb276ce
And
seems not right, the kube-proxy will reclose it. kubernetes/pkg/proxy/iptables/proxier.go Lines 1600 to 1609 in fb276ce
look at line 1607. So here we adjust the order of open the port seems make no much difference. if I understood fault, plz point out @aojea :) |
Were nodeports always limited to a range? Or did that come later and so the port opener used to make more sense?
Do we limit hostports to a specific range? We need to prevent people from using hostports to snipe nodeports (and system services) (and hostnetwork pods).
So that's interesting. In OpenShift 4.x, we have pretty much declared that you are not allowed to run your own non-containerized software on an OpenShift node, but I had assumed that people didn't feel that way about kubernetes clusters in general. Eg, kubernetes explicitly takes into account the fact that another piece of software on the system may accidentally delete its iptables rules out from under it; and there was some discussion in #92938 about whether it's reasonable for kube-proxy to change the value of host-wide kernel networking sysctls. If the correct response to "I installed software behind kubernetes's back and then kubernetes stole its port away from it" is "Don't do that then!" then yeah, the port opener seems a lot less necessary. (And maybe there could be some way to explicitly indicate to kubernetes that, eg, you're using port 22 for something, so stay away?) |
but why holding the port open? do we want to prevent people accidentally opening services behind forwarded ports? |
look at the notes kubernetes/pkg/proxy/iptables/proxier.go Lines 1099 to 1108 in b3dd01d
I think opening the port is a protective measures. |
yeah, you are right, but that is for ExternalIP that also has its own CVE 😄 . What about make the port holding configurable, something like Selinux:
|
ok, my hunch was right but I was wrong about the reason, it is not the socket thing, it is the the conntrack cleaning who was to be done after the rules are programmed, or it can happen that during the time that we clean conntrack for the UDP NodePort and the rules are programmed a conntrack entry is inserted. I just need to confirm that this actually fixes the problem, but we have a way to reproduce it now, so I will test it tomorrow This PR. moves the code to clean conntrack after the iptables rules are programmed, so it should solve the issue too. cc: @smarterclayton |
Clear conntrack entries for UDP NodePorts, this has to be done AFTER the iptables rules are programmed. It can happen that traffic to the NodePort hits the host before the iptables rules are programmed this will create an stale entry in conntrack that will blackhole the traffic, so we need to clear it ONLY when the service has endpoints.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, danwinship The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Review the full test history for this PR. Silence the bot with an |
/retest |
/retest |
/retest Review the full test history for this PR. Silence the bot with an |
1 similar comment
/retest Review the full test history for this PR. Silence the bot with an |
/retest Many flakes lately .... |
@aojea - great finding - thanks a lot for debugging and fixing this! |
@aojea , does it makes sense to have cherry-picks for this PR? |
yes it does, are you volunteering :) ? |
@ialidzhikov FYI #99017 |
Thanks for asking @aojea , but I don't think I have the required expertise in this area. I just wanted to make sure that we don't miss to backport a PR that deserves a backport. :) |
…upstream-release-1.20 Automated cherry pick of #98305: kube-proxy: clear conntrack entries after rules are in place
What type of PR is this?
/kind bug
/kind failing-test
/kind flake
What this PR does / why we need it:
kube-proxy needs to clean the stale UDP conntrack entries for NodePorts, however,
this was done before the iptables rules were programmed and when the NodePort service was added,
without having into account if endpoints were available.
The conntrack entries have to be cleared after the iptables rules are programmed and when
the service has endpoints available.
Which issue(s) this PR fixes:
Fixes #91236
Special notes for your reviewer:
I´ve found out that the flake failed constantly, if the nodeport port never cleared the conntrack entries (UDP) on the opened socket
#98130 (comment)
We´ve modifed the test to connect in parallel to the ClusterIP, so we can verify that the pods and the services are working correctly, the connection to the ClusterIP works, the conntrack entry is as follows
However, the NodePort connection doesn´t work, and conntrack entry is the following:
the iptables entries are correctly installed, but the packet is not forwarded because is hitting that conntrack entry,
Does this PR introduce a user-facing change?: