Don't spawn a goroutine for every event recording #95664

DirectXMan12 · 2020-10-16T22:17:41Z

This changes the event recorder to use the equivalent of a select
statement instead of a goroutine to record events.

Previously, we used a goroutine to make event recording non-blocking.
Unfortunately, this writes to a channel, and during shutdown we then
race to write to a closed channel, panicing (caught by the error
handler, but still) and making the race detector unhappy.

Instead, we now use the select statement to make event emitting
non-blocking, and if we'd block, we just drop the event. We already
drop events if a particular sink is overloaded, so this just moves the
incoming event queue to match that behavior (and makes the incoming
event queue much longer).

This means that, if the user uses Eventf and friends correctly (i.e.
ensure they've returned by the time they call Shutdown), it's
now safe to call Shutdown. This matches the conventional go guidance on
channels: the writing goroutine should call close.

Fixes #94906

/kind bug
/sig api-machinery

Ensure that client-go's EventBroadcaster is safe (non-racy) during shutdown.

DirectXMan12 · 2020-10-16T22:19:29Z

Additional points:

I introduced a new method with a longer queue size in an attempt to be minimally disruptive to the other, much more important use of this code: watch broadcasting.

The longer queue size is to avoid dropping events on the floor when possible, and brings the incoming queue in line with the outgoing queues, which already had the "drop events on the floor if needed" behavior.

fedebongio · 2020-10-20T20:10:44Z

/assign @yliaog
/triage accepted

yliaog · 2020-11-12T00:12:23Z

staging/src/k8s.io/apimachinery/pkg/watch/mux.go

@@ -198,6 +214,18 @@ func (m *Broadcaster) Action(action EventType, obj runtime.Object) {
 	m.incoming <- Event{action, obj}
 }

+// Action distributes the given event among all watchers, or drops it on the floor
+// if too many incoming actions are queue up.  Returns true if the action was sent,


s/queue/queued/

yliaog · 2020-11-12T00:22:09Z

staging/src/k8s.io/apimachinery/pkg/watch/mux.go

+func NewLongQueueBroadcaster(queueLength int, fullChannelBehavior FullChannelBehavior) *Broadcaster {
+	m := &Broadcaster{
+		watchers:            map[int64]*broadcasterWatcher{},
+		incoming:            make(chan Event, queueLength),


so effectively, the incoming queue length is increased from 25 to 1000, any idea why it was only 25 in the first place? the comment below indicated it should rarely happen, did you see the events dropped due to incoming queue length?

// Buffer the incoming queue a little bit even though it should rarely ever accumulate
// anything, just in case a few events are received in such a short window that
// Broadcaster can't move them onto the watchers' queues fast enough.

since we're no longer blocking, I figured I'd extend it to match the outgoing queue, which currently already has a "drop on full behavior". As much as possible, I wanted to avoid "breaking" something in k/k itself, so I figured matching the outgoing queues was a safe bet.

yliaog · 2020-11-12T00:22:52Z

staging/src/k8s.io/client-go/tools/record/event.go

+	// NOTE: events should be a non-blocking operation, but we also need to not
+	// put this in a goroutine, otherwise we'll race to write to a closed channel
+	// when we go to shut down this broadcaster.  Just drop events if we get overloaded,
+	// and log an error us if that happens (we've configured the broadcaster to drop


s/log an error us if/log an error if/

yliaog · 2020-11-12T00:28:21Z

staging/src/k8s.io/client-go/tools/record/event_test.go

@@ -101,6 +102,29 @@ func OnPatchFactory(testCache map[string]*v1.Event, patchEvent chan<- *v1.Event)
 	}
 }

+func TestNonRacyShutdown(t *testing.T) {
+	// Attempt to simulate previously racy conditions, and ensure that no race


the test by itself does not assert no race condition. it would rely on the test being run with race detector enabled.

IIRC, that's not true -- the "race" is pretty easily detected as a write to a closed channel, which works even w/o the race detector (it just panics)

yliaog · 2020-11-12T00:36:12Z

staging/src/k8s.io/client-go/tools/record/event.go

+	// and log an error us if that happens (we've configured the broadcaster to drop
+	// outgoing events anyway).
+	if sent := recorder.ActionOrDrop(watch.Added, event); !sent {
+		klog.Errorf("unable to record event: too many queued events, dropped event %#v", event)


is dropping event an error? could it possibly spam logs when too many events are dropped?

IIRC, we log when we drop from the other end. It's unlikely that this happens, you probably want to know when it does, and we log events anyway.

yliaog · 2020-11-14T00:10:01Z

/lgtm

DirectXMan12 · 2021-01-14T21:40:09Z

/retest

DirectXMan12 · 2021-01-15T22:35:48Z

/assign @caesarxuchao

(for approval)

caesarxuchao · 2021-01-15T22:56:31Z

LGTM. Can you fix the typos Yu pointed out?

This changes the event recorder to use the equivalent of a select statement instead of a goroutine to record events. Previously, we used a goroutine to make event recording non-blocking. Unfortunately, this writes to a channel, and during shutdown we then race to write to a closed channel, panicing (caught by the error handler, but still) and making the race detector unhappy. Instead, we now use the select statement to make event emitting non-blocking, and if we'd block, we just drop the event. We already drop events if a particular sink is overloaded, so this just moves the incoming event queue to match that behavior (and makes the incoming event queue much longer). This means that, if the user uses `Eventf` and friends correctly (i.e. ensure they've returned by the time we call `Shutdown`), it's now safe to call Shutdown. This matches the conventional go guidance on channels: the writer should call close.

yliaog · 2021-01-15T23:38:59Z

/lgtm

caesarxuchao · 2021-01-15T23:52:19Z

/lgtm

DirectXMan12 · 2021-01-19T19:19:54Z

@caesarxuchao still needs approved label, looks like

caesarxuchao · 2021-01-20T21:47:20Z

/approve

k8s-ci-robot · 2021-01-20T21:47:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caesarxuchao, DirectXMan12

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/apimachinery/pkg/OWNERS~~ [caesarxuchao]
~~staging/src/k8s.io/client-go/OWNERS~~ [caesarxuchao]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2021-01-21T00:44:31Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

fejta-bot · 2021-01-21T04:56:27Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

vincepri · 2021-01-25T18:00:03Z

Can this change be backported to 1.20?

k8s-ci-robot requested review from dchen1107 and gmarek October 16, 2020 22:20

k8s-ci-robot assigned yliaog Oct 20, 2020

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 20, 2020

DirectXMan12 force-pushed the bug/non-racy-recorder-shutdown branch from 799a7b4 to 7a96e6a Compare October 20, 2020 20:31

yliaog reviewed Nov 12, 2020

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 14, 2020

k8s-ci-robot assigned deads2k Jan 15, 2021

k8s-ci-robot assigned caesarxuchao and unassigned deads2k Jan 15, 2021

DirectXMan12 force-pushed the bug/non-racy-recorder-shutdown branch from 7a96e6a to e90e67b Compare January 15, 2021 22:59

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 20, 2021

k8s-ci-robot merged commit 9d99dbc into kubernetes:master Jan 21, 2021

k8s-ci-robot added this to the v1.21 milestone Jan 21, 2021

github-actions bot mentioned this pull request Jan 26, 2021

Week Ending January 24, 2021 dev-obs/actus#329

Open

caesarxuchao mentioned this pull request Jan 26, 2021

It seems to be impossible to safely stop a Event Recorder without invoking a race condition #94906

Closed

This was referenced Feb 4, 2021

Panic in Event Broadcaster kubernetes-sigs/controller-runtime#1367

Closed

Event recording no longer panics on stopped broadcaster. #98764

Closed

dims mentioned this pull request Feb 9, 2021

Automated cherry pick of #95664: Don't record events in goroutines #98922

Closed

DirectXMan12 deleted the bug/non-racy-recorder-shutdown branch February 9, 2021 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't spawn a goroutine for every event recording #95664

Don't spawn a goroutine for every event recording #95664

DirectXMan12 commented Oct 16, 2020 •

edited

DirectXMan12 commented Oct 16, 2020

fedebongio commented Oct 20, 2020

yliaog Nov 12, 2020

yliaog Nov 12, 2020

DirectXMan12 Nov 12, 2020

yliaog Nov 12, 2020

yliaog Nov 12, 2020

DirectXMan12 Nov 12, 2020

yliaog Nov 12, 2020

DirectXMan12 Nov 12, 2020

yliaog commented Nov 14, 2020

DirectXMan12 commented Jan 14, 2021

DirectXMan12 commented Jan 15, 2021

caesarxuchao commented Jan 15, 2021

yliaog commented Jan 15, 2021

caesarxuchao commented Jan 15, 2021

DirectXMan12 commented Jan 19, 2021

caesarxuchao commented Jan 20, 2021

k8s-ci-robot commented Jan 20, 2021

fejta-bot commented Jan 21, 2021

fejta-bot commented Jan 21, 2021

vincepri commented Jan 25, 2021

Don't spawn a goroutine for every event recording #95664

Don't spawn a goroutine for every event recording #95664

Conversation

DirectXMan12 commented Oct 16, 2020 • edited

DirectXMan12 commented Oct 16, 2020

fedebongio commented Oct 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yliaog commented Nov 14, 2020

DirectXMan12 commented Jan 14, 2021

DirectXMan12 commented Jan 15, 2021

caesarxuchao commented Jan 15, 2021

yliaog commented Jan 15, 2021

caesarxuchao commented Jan 15, 2021

DirectXMan12 commented Jan 19, 2021

caesarxuchao commented Jan 20, 2021

k8s-ci-robot commented Jan 20, 2021

fejta-bot commented Jan 21, 2021

fejta-bot commented Jan 21, 2021

vincepri commented Jan 25, 2021

DirectXMan12 commented Oct 16, 2020 •

edited