Increase timeout for pod lifecycle test to reach pod status=ready #96691

hh · 2020-11-18T20:53:10Z

We had a single flake in a couple weeks.
It seems we were able to reach PodScheduled, but just needed a bit more time for it to reach state running.

[It] should run through the lifecycle of Pods and PodStatus
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/pods.go:887
ï¿½[1mSTEPï¿½[0m: creating a Pod with a static label
ï¿½[1mSTEPï¿½[0m: watching for Pod to be ready
Nov 18 13:28:04.312: INFO: observed Pod pod-test in namespace pods-5897 in phase Pending conditions []
Nov 18 13:28:04.360: INFO: observed Pod pod-test in namespace pods-5897 in phase Pending conditions [{PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-11-18 13:28:04 +0000 UTC  }]
Nov 18 13:29:04.265: FAIL: failed to see Pod pod-test in namespace pods-5897 running
Unexpected error:
    <*errors.errorString | 0xc0002781f0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred

k8s-ci-robot · 2020-11-18T20:53:18Z

@hh: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hh · 2020-11-18T20:53:29Z

/release-note-none

hh · 2020-11-18T20:53:52Z

/sig architecture
/area conformance

hh · 2020-11-18T20:54:07Z

/kind flake

heyste · 2020-11-18T21:45:00Z

/test pull-kubernetes-e2e-gce-100-performance
flake

error during /workspace/log-dump.sh /workspace/_artifacts gs://kubernetes-jenkins/pr-logs/pull/96691/pull-kubernetes-e2e-gce-100-performance/1329165785196662784/artifacts: exit status 1

andrewsykim · 2020-11-18T22:53:31Z

ref #96565

andrewsykim · 2020-11-18T22:53:52Z

test/e2e/common/pods.go

@@ -59,7 +59,7 @@ const (
 	maxBackOffTolerance  = time.Duration(1.3 * float64(kubelet.MaxContainerBackOff))
 	podRetryPeriod       = 1 * time.Second
 	podRetryTimeout      = 1 * time.Minute
-	podReadyTimeout      = 1 * time.Minute


I think it's worthwhile figuring out why this started timing out

I only see this once, we do see the Pod was scheduled quite quickly, but never progressed to Running.
It may have been a slow pull, but it's not clear that we can see that via the logs.

Nov 18 13:28:04.312: INFO: observed Pod pod-test in namespace pods-5897 in phase Pending conditions []
Nov 18 13:28:04.360: INFO: observed Pod pod-test in namespace pods-5897 in phase Pending conditions
[{PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-11-18 13:28:04 +0000 UTC }]
Nov 18 13:29:04.265: FAIL: failed to see Pod pod-test in namespace pods-5897 running

Checking the graph metrics for this e2e test the test run that flaked looks to be due to fluctuations in the cluster environment. Three other tests also failed in the same run.

The current timeout is a lot more optimistic than the values set in wait.go and waiting a bit more will help the test deal with any fluctuations in the clusters state.

interestingly, the place where this test is failing has been due to watch timeouts failing:

kubernetes/test/e2e/node/pods.go

Lines 307 to 317 in 5e44d8e

select {

case events, ok = <-ch:

if !ok {

continue

}

if len(events) < 2 {

framework.Fail("only got a single event")

}

case <-time.After(5 * time.Minute):

framework.Failf("timed out waiting for watch events for %s", pod.Name)

}

And that has a timeout of 5 minutes

Sorry, realizing that this PR and #96565 are actually referring to two different tests, but the root cause seems likely similar.

heyste · 2020-11-18T23:13:14Z

/test pull-kubernetes-e2e-gce-100-performance
flake

e2e.go: DumpClusterLogs
error during /workspace/log-dump.sh /workspace/_artifacts gs://kubernetes-jenkins/pr-logs/pull/96691/pull-kubernetes-e2e-gce-100-performance/1329178829016535040/artifacts: exit status 1
...
 W1118 22:33:54.281] ERROR: (gcloud.logging.read) INTERNAL: Internal error encountered.

harche · 2020-11-19T03:42:33Z

/test pull-kubernetes-node-crio-e2e

derekwaynecarr · 2021-01-26T16:08:23Z

the timeout change is fine for me, we can always come back and tighten if total execution time starts to grow.

/approve
/lgtm

derekwaynecarr · 2021-01-26T16:08:39Z

/assign

k8s-ci-robot · 2021-01-26T16:08:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, hh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e/common/OWNERS~~ [derekwaynecarr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Increase timeout for pods to be ready

2437b9d

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 18, 2020

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 18, 2020

k8s-ci-robot added sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. area/conformance Issues or PRs related to kubernetes conformance tests and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 18, 2020

k8s-ci-robot requested review from derekwaynecarr and resouer November 18, 2020 20:53

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. kind/flake Categorizes issue or PR as related to a flaky test. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 18, 2020

hh mentioned this pull request Nov 18, 2020

Promote Pod+PodStatus resource lifecycle test - +4 endpoint coverage #96485

Merged

andrewsykim reviewed Nov 18, 2020

View reviewed changes

mmerkes mentioned this pull request Dec 10, 2020

kubelet-NodeConformance test has continually failing test #97130

Closed

k8s-ci-robot assigned derekwaynecarr Jan 26, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 26, 2021

k8s-ci-robot merged commit a107769 into kubernetes:master Jan 26, 2021

k8s-ci-robot added this to the v1.21 milestone Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase timeout for pod lifecycle test to reach pod status=ready #96691

Increase timeout for pod lifecycle test to reach pod status=ready #96691

hh commented Nov 18, 2020

k8s-ci-robot commented Nov 18, 2020

hh commented Nov 18, 2020

hh commented Nov 18, 2020

hh commented Nov 18, 2020

heyste commented Nov 18, 2020

andrewsykim commented Nov 18, 2020

andrewsykim Nov 18, 2020

hh Nov 18, 2020

heyste Nov 18, 2020

andrewsykim Nov 19, 2020

andrewsykim Nov 19, 2020

heyste commented Nov 18, 2020

harche commented Nov 19, 2020

derekwaynecarr commented Jan 26, 2021

derekwaynecarr commented Jan 26, 2021

k8s-ci-robot commented Jan 26, 2021

	select {
	case events, ok = <-ch:
	if !ok {
	continue
	}
	if len(events) < 2 {
	framework.Fail("only got a single event")
	}
	case <-time.After(5 * time.Minute):
	framework.Failf("timed out waiting for watch events for %s", pod.Name)
	}

Increase timeout for pod lifecycle test to reach pod status=ready #96691

Increase timeout for pod lifecycle test to reach pod status=ready #96691

Conversation

hh commented Nov 18, 2020

k8s-ci-robot commented Nov 18, 2020

hh commented Nov 18, 2020

hh commented Nov 18, 2020

hh commented Nov 18, 2020

heyste commented Nov 18, 2020

andrewsykim commented Nov 18, 2020

andrewsykim Nov 18, 2020

Choose a reason for hiding this comment

hh Nov 18, 2020

Choose a reason for hiding this comment

heyste Nov 18, 2020

Choose a reason for hiding this comment

andrewsykim Nov 19, 2020

Choose a reason for hiding this comment

andrewsykim Nov 19, 2020

Choose a reason for hiding this comment

heyste commented Nov 18, 2020

harche commented Nov 19, 2020

derekwaynecarr commented Jan 26, 2021

derekwaynecarr commented Jan 26, 2021

k8s-ci-robot commented Jan 26, 2021