Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase timeout for pod lifecycle test to reach pod status=ready #96691

Merged
merged 1 commit into from Jan 26, 2021

Conversation

hh
Copy link
Member

@hh hh commented Nov 18, 2020

We had a single flake in a couple weeks.
It seems we were able to reach PodScheduled, but just needed a bit more time for it to reach state running.

[It] should run through the lifecycle of Pods and PodStatus
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/pods.go:887
�[1mSTEP�[0m: creating a Pod with a static label
�[1mSTEP�[0m: watching for Pod to be ready
Nov 18 13:28:04.312: INFO: observed Pod pod-test in namespace pods-5897 in phase Pending conditions []
Nov 18 13:28:04.360: INFO: observed Pod pod-test in namespace pods-5897 in phase Pending conditions [{PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-11-18 13:28:04 +0000 UTC  }]
Nov 18 13:29:04.265: FAIL: failed to see Pod pod-test in namespace pods-5897 running
Unexpected error:
    <*errors.errorString | 0xc0002781f0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 18, 2020
@k8s-ci-robot
Copy link
Contributor

@hh: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 18, 2020
@hh
Copy link
Member Author

hh commented Nov 18, 2020

/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 18, 2020
@hh
Copy link
Member Author

hh commented Nov 18, 2020

/sig architecture
/area conformance

@k8s-ci-robot k8s-ci-robot added sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. area/conformance Issues or PRs related to kubernetes conformance tests and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 18, 2020
@hh
Copy link
Member Author

hh commented Nov 18, 2020

/kind flake

@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. kind/flake Categorizes issue or PR as related to a flaky test. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 18, 2020
@heyste
Copy link
Member

heyste commented Nov 18, 2020

/test pull-kubernetes-e2e-gce-100-performance
flake

error during /workspace/log-dump.sh /workspace/_artifacts gs://kubernetes-jenkins/pr-logs/pull/96691/pull-kubernetes-e2e-gce-100-performance/1329165785196662784/artifacts: exit status 1

@andrewsykim
Copy link
Member

ref #96565

@@ -59,7 +59,7 @@ const (
maxBackOffTolerance = time.Duration(1.3 * float64(kubelet.MaxContainerBackOff))
podRetryPeriod = 1 * time.Second
podRetryTimeout = 1 * time.Minute
podReadyTimeout = 1 * time.Minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worthwhile figuring out why this started timing out

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see this once, we do see the Pod was scheduled quite quickly, but never progressed to Running.
It may have been a slow pull, but it's not clear that we can see that via the logs.

Nov 18 13:28:04.312: INFO: observed Pod pod-test in namespace pods-5897 in phase Pending conditions []
Nov 18 13:28:04.360: INFO: observed Pod pod-test in namespace pods-5897 in phase Pending conditions
[{PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-11-18 13:28:04 +0000 UTC }]
Nov 18 13:29:04.265: FAIL: failed to see Pod pod-test in namespace pods-5897 running

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the graph metrics for this e2e test the test run that flaked looks to be due to fluctuations in the cluster environment. Three other tests also failed in the same run.

The current timeout is a lot more optimistic than the values set in wait.go and waiting a bit more will help the test deal with any fluctuations in the clusters state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interestingly, the place where this test is failing has been due to watch timeouts failing:

select {
case events, ok = <-ch:
if !ok {
continue
}
if len(events) < 2 {
framework.Fail("only got a single event")
}
case <-time.After(5 * time.Minute):
framework.Failf("timed out waiting for watch events for %s", pod.Name)
}

And that has a timeout of 5 minutes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, realizing that this PR and #96565 are actually referring to two different tests, but the root cause seems likely similar.

@heyste
Copy link
Member

heyste commented Nov 18, 2020

/test pull-kubernetes-e2e-gce-100-performance
flake

e2e.go: DumpClusterLogs
error during /workspace/log-dump.sh /workspace/_artifacts gs://kubernetes-jenkins/pr-logs/pull/96691/pull-kubernetes-e2e-gce-100-performance/1329178829016535040/artifacts: exit status 1
...
 W1118 22:33:54.281] ERROR: (gcloud.logging.read) INTERNAL: Internal error encountered. 

@harche
Copy link
Contributor

harche commented Nov 19, 2020

/test pull-kubernetes-node-crio-e2e

@derekwaynecarr
Copy link
Member

the timeout change is fine for me, we can always come back and tighten if total execution time starts to grow.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2021
@derekwaynecarr
Copy link
Member

/assign

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, hh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 26, 2021
@k8s-ci-robot k8s-ci-robot merged commit a107769 into kubernetes:master Jan 26, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/conformance Issues or PRs related to kubernetes conformance tests area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants