Merge pull request #5474 from hasheddan/fff

Add Flake Finder Fridays assets
This commit is contained in:
Kubernetes Prow Robot 2021-02-06 15:13:12 -08:00 committed by GitHub
commit 1bbde30312
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 147 additions and 0 deletions

View File

@ -0,0 +1,5 @@
# Flake Finder Fridays
Flake Finder Fridays is a monthly livestream show where we explore recent test
failures and flakes, walk through how they were resolved, and share tips and
tricks for troubleshooting Kubernetes CI issues.

View File

@ -0,0 +1,142 @@
# Flake Finder Fridays #0
February 5th 2021 ([Recording](https://youtu.be/Hqlm2h2AEvA))
Hosts: [Dan Mangum](https://github.com/hasheddan), [Rob
Kielty](https://github.com/RobertKielty)
## Introduction
This is the first episode of Flake Finder Fridays with Dan Mangum and Rob
Kielty.
On the first friday of every month we will go through an issue that was logged
for a failing or flaking test on the Kubernetes project.
We will review the triage, root cause analysis, and problem resolution for a
test related issue logged in the past four weeks.
We intend to demo how CI works on the Kubernetes project and also how we
collaborate across teams to resolve test maintenance issues.
## Issue This is the issue that we are going to look at today ...
[[Failing Test] ci-kubernetes-build-canary does not understand
"--platform"](https://github.com/kubernetes/kubernetes/issues/98646)
### Testgrid Dashboard
[build-master-canary](https://testgrid.k8s.io/sig-release-master-informing#build-master-canary)
### Breaking PRs
- [Use buildx in favor of `FROM --platform` syntax
](https://github.com/kubernetes/kubernetes/pull/98529)
- [Switch to `docker buildx` for conformance
image](https://github.com/kubernetes/kubernetes/pull/98569)
## Investigation
1. Desire to move from Google-owned infrastructure to Kubernetes community
infrastructure. Thus the introduction of a **canary** build job to test pushing
building and pushing artifacts with new infrastructure.
1. Desire to move off of `bootstrap.py` job (currently being used for canary
job) to `krel` tooling.
1. Separate job existed (`ci-kubernetes-build-no-bootstrap`) that was doing the
same thing as the canary job, but with `krel` tooling.
1. The `no-bootstrap` job was running smoothly, so [updated to use it for the
canary job](https://github.com/kubernetes/test-infra/pull/20663).
1. Right before the update, we [switched to using buildx for multi-arch
images](https://github.com/kubernetes/kubernetes/pull/98529).
1. Job started failing, which showed up in [some interesting
ways](https://kubernetes.slack.com/archives/C09QZ4DQB/p1612269558032700).
1. Triage begins! Issue
[opened](https://github.com/kubernetes/kubernetes/issues/98646) and release
management team is pinged in Slack.
1. The `build-master`
[job](https://testgrid.k8s.io/sig-release-master-blocking#build-master) was
still passing though... interesting.
1. Both are eventually calling `make release`, so environment must be different.
1. Let's look inside!
``` docker run -it --entrypoint /bin/bash
gcr.io/k8s-testimages/bootstrap:v20210130-12516b2 ```
``` docker run -it
gcr.io/k8s-staging-releng/k8s-ci-builder:v20201128-v0.6.0-6-g6313f696-default
/bin/bash ```
1. A few directions we could go here:
1. Update the `k8s-ci-builder` image to you use newer version of Docker
1. Update the `k8s-ci-builder` image to ensure that
`DOCKER_CLI_EXPERIMENTAL=enabled` is set
1. Update the `release.sh` script to set `DOCKER_CLI_EXPERIMENTAL=enabled`
1. Making the `release.sh` script more flexible serves the community better
because it allows for building with more environments. Would also be good to
update the `k8s-ci-builder` image for this specific case as well.
1. And we get a new
[failure](https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-build-canary/1356704759045689344/build-log.txt)!
1. Let's see what is going on in those images again...
1. Why would this cause an error in one but not the other if we have
`DOCKER_CLI_EXPERIMENTAL=enabled`?
([this](https://github.com/docker/buildx/pull/403) is why)
1. In the mean time we went ahead and [re-enabled the bootstrap
job](https://github.com/kubernetes/test-infra/pull/20712) (consumers of those
images need them!)
1. Decided to [increase logging
verbosity](https://github.com/kubernetes/kubernetes/pull/98568) on failures to
see if that would give us a clue into what was going wrong (and to remove those
annoying `quiet currently not implemented` warnings).
1. Job turns green! But how?
1. [Buildx](https://github.com/docker/buildx) is versioned separately than
Docker itself. Turns out that the `--quiet` flag warning was [actually an
error](https://github.com/docker/buildx/pull/403) until `v0.5.1` of Buildx.
1. The `build-master` job was running with buildx `v0.5.1` while the `krel` job
was running with `v0.4.2`. This meant the quiet flag was causing an error in the
`krel` job, and removing it alleviated the error.
1. Finished up by once again [removing the `bootstrap`
job](https://github.com/kubernetes/test-infra/pull/20731).
### Fixes
- [Set DOCKER_CLI_EXPERIMENTAL=enabled for images using
buildx](https://github.com/kubernetes/kubernetes/pull/98672)
- [Make image build logs verbose if
necessary](https://github.com/kubernetes/kubernetes/pull/98568)
### Test Infra
- [ci-kubernetes-build-canary: Migrate from bootstrap to
krel](https://github.com/kubernetes/test-infra/pull/20663)
- [releng: Re-enable a bootstrap build job for K8s
Infra](https://github.com/kubernetes/test-infra/pull/20712)
- [Revert "releng: Re-enable a bootstrap build job for K8s
Infra"](https://github.com/kubernetes/test-infra/pull/20731)
### Slack Threads
- [kubeadm failing with
ci/latest](https://kubernetes.slack.com/archives/C09QZ4DQB/p1612269558032700)
### Helpful Links
- [Docker Buildx
Documentation](https://docs.docker.com/buildx/working-with-buildx/)
- [What is Docker Buildkit and What can I use it
for?](https://brianchristner.io/what-is-docker-buildkit/)
- [Buildx --quiet error](https://github.com/docker/buildx/pull/403)
## Kubernetes Project Resources
Brand new to the project?
- Start here: https://www.kubernetes.dev/
Setup already and interested in maintaining tests?
- Check out [this video](https://www.youtube.com/watch?v=Ewp8LNY_qTg) from
Jordan Liggit who describes strategies and tactics to deflake flaking tests
([Jordan's show notes for that
talk](https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0))
Here's how the CI Signal Team actively monitors CI during a release cycle:
- [A Tour of CI on the Kubernetes
Project](https://www.youtube.com/watch?v=bttEcArAjUw)
- [Show notes](bit.ly/k8s-ci)