From eeb0f7c7d158dc60e83f9e4b810a82e3c1803ff8 Mon Sep 17 00:00:00 2001 From: "Madhusudan.C.S" Date: Thu, 1 Jun 2017 16:04:46 -0700 Subject: [PATCH 1/2] First draft of federation buildcop guide/playbook. Still needs: 1. Responsibilities section. 2. More entries in the playbook. --- .../devel/on-call-federation-build-cop.md | 159 ++++++++++++++++++ 1 file changed, 159 insertions(+) create mode 100644 contributors/devel/on-call-federation-build-cop.md diff --git a/contributors/devel/on-call-federation-build-cop.md b/contributors/devel/on-call-federation-build-cop.md new file mode 100644 index 000000000..22e447670 --- /dev/null +++ b/contributors/devel/on-call-federation-build-cop.md @@ -0,0 +1,159 @@ +# Federation Buildcop Guide and Playbook + +Federation runs two classes of tests: CI and Presubmits. + +## CI + +* These tests run on the HEADs of master and release branches. +* As a result, they run on code that's already merged. +* As the name suggests, they run continuously. Currently, they are + configured to run at least once every 30 minutes. +* Federation CI tests still run on Jenkins and this mode of testing is + now deprecated. +* CI jobs always run sequentially. In other words, no single CI job + can have two instances of the job running at the same time. + +### Configuration + +Configuration steps are described in https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/jenkins/README.md#how-to-work-with-jenkins-jobs + +The configuration of CI tests are stored in: + +* Jenkins config: https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/jenkins/job-configs/kubernetes-jenkins/bootstrap-ci.yaml +* Test job/bootstrap config: https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/jobs/config.json +* Test grid config: https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/testgrid/config/config.yaml +* Job specific config: https://github.com/kubernetes/test-infra/tree/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/jobs + +### Results + +Results of all the federation CI tests, including the soak tests, are +listed in the corresponding tabs on the Cluster Federation page in the +testgrid. +https://k8s-testgrid.appspot.com/cluster-federation + +### Playbook + +#### Triggering a new run + +Please ping someone who has access to the Jenkins UI/dashboard and ask +them to login and click the "Build Now" link on the Jenkins page +corresponding to the CI job you want to manually start. + +#### Quota cleanup + +Please ping someone who has access to the GCP project and ask them to +look at the quotas and clean up the leaked resources. + + +## Presubmit + +* We only have one presubmit test, but it is configured very + differently than the CI tests. +* The presubmit test is currently configured to run on the master + branch and any release branch that's 1.7 or newer. +* Federation presubmit infrastructure is composed of two separate test + jobs: + * Deploy job: This job runs in the background and recycles federated + clusters every time it runs. Although this job supports federation + presubmit tests, it is configured as a CI/Soak job. More on + configuration later. Since recycling federated clusters is an + expensive operation, we do not want to run this often. Hence, this + job is configured to run once every 24 hours, around midnight + pacific time. + * Test job: This is the job that runs federation presubmit tests on + every PR in the core repository, i.e. + [kubernetes/kubernetes](https://github.com/kubernetes/kubernetes). + These jobs can run in parallel on the PRs in the repository. + +### Two-job setup + +The deploy job runs once 24-hours roughly at around midnight pacific +time. It is configured to turn up and tear down 3 federated clusters. +It starts out by downloading the latest Kubernetes release built from +[kubernetes/kubernetes](https://github.com/kubernetes/kubernetes) +master. It then tears down the existing federated clusters and turns +up new ones. As the clusters are created, their kubeconfigs are +written to a local kubeconfig file where the job runs. Once all the +clusters are successfully turned up, the local kubeconfig is then +copied to a pre-configured GCS bucket. Any existing kubeconfig in the +bucket will be overwritten. + +The test job on the other hand starts by copying the latest kubeconfig +from the pre-configured GCS bucket. It uses this kubeconfig to deploy +a new federation control plane in the on one of the clusters in the +kubeconfig and joins all the clusters, including the host cluster +where federation control plane is deployed, as members to the newly +created federation control plane. It then runs the federation +presubmit tests on this control plane and tears down the control plane +in the end. + +Since federated clusters are recycled only once every 24 hours, all +presubmit runs in that period share the federated clusters. And since +there could be multiple presubmit tests running in parallel, each +instance of the test gets its own namespace where it deploys the +federation control plane. These federation control planes deployed in +separate namespaces are independent of each other and do not interfere +with other federation control planes in anyway. + +### Configuration + +The two jobs are configured differently. + +#### Deploy job + +The deploy job is configured as a CI/Soak job in Jenkins. +Configuration steps are described in https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/jenkins/README.md#how-to-work-with-jenkins-jobs + +The configuration of the deploy job is stored in: + +* Jenkins config: https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/jenkins/job-configs/kubernetes-jenkins/bootstrap-ci-soak.yaml#L76 +* Test job/bootstrap config: https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/jobs/config.json#L3996 +* Test grid config: https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/testgrid/config/config.yaml#L152 +* Job specific config: https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/jobs/ci-kubernetes-pull-gce-federation-deploy.env + +#### Test job + +The test job is configured in prow, but it runs in Jenkins mode. The +configuration steps are described in https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/README.md#create-a-new-job + +The configuration of the test job is stored in: + +* Prow config: https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/prow/config.yaml#L244 +* Test job/bootstrap config: https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/jobs/config.json#L4691 +* Job specific config: https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/jobs/pull-kubernetes-federation-e2e-gce.env + +### Results + +Aggregated results are available on the Gubernator dashboard page for +the federation presubmit tests. + +https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-federation-e2e-gce + +### Metrics + +We track the flakiness metrics of all the presubmit jobs and +individual tests that run against PRs in +[kubernetes/kubernetes](https://github.com/kubernetes/kubernetes). + +* The metrics that we track are documented in https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/metrics/README.md#metrics. +* Job-level metrics are available in - http://storage.googleapis.com/k8s-metrics/job-flakes-latest.json. +* As of this writing, federation presubmits have a success rate of + 93.4% + +### Playbook + +#### Triggering a new deploy job run + +Please ping someone who has access to the Jenkins UI/dashboard and ask +them to login and click the "Build Now" link on the Jenkins page +corresponding to the CI job you want to manually start. + +#### Triggering a new deploy job run + +Use the @k8s-bot on the PR to retrigger the test. The exact bot +incantation is: `@k8s-bot pull-kubernetes-federation-e2e-gce test this` + +#### Quota cleanup + +Please ping someone who has access to the GCP project and ask them to +look at the quotas and clean up the leaked resources. From fa184860fec58529a465bd3e80df70a718f70c43 Mon Sep 17 00:00:00 2001 From: "Madhusudan.C.S" Date: Fri, 14 Jul 2017 00:42:40 -0700 Subject: [PATCH 2/2] Addressed review comments. --- .../devel/on-call-federation-build-cop.md | 60 +++++++++++-------- 1 file changed, 34 insertions(+), 26 deletions(-) diff --git a/contributors/devel/on-call-federation-build-cop.md b/contributors/devel/on-call-federation-build-cop.md index 22e447670..bf8b427ab 100644 --- a/contributors/devel/on-call-federation-build-cop.md +++ b/contributors/devel/on-call-federation-build-cop.md @@ -4,12 +4,14 @@ Federation runs two classes of tests: CI and Presubmits. ## CI -* These tests run on the HEADs of master and release branches. +* These tests run on the HEADs of master and release branches (starting + from Kubernetes v1.6). * As a result, they run on code that's already merged. * As the name suggests, they run continuously. Currently, they are - configured to run at least once every 30 minutes. -* Federation CI tests still run on Jenkins and this mode of testing is - now deprecated. + configured to run + [at least once every 30 minutes](https://github.com/kubernetes/test-infra/blob/22c38cfb64137086373e1b89d5e7d98766560747/prow/config.yaml#L3686). +* Federation CI tests run as + [periodic jobs on prow](https://github.com/kubernetes/test-infra/blob/22c38cfb64137086373e1b89d5e7d98766560747/prow/config.yaml#L3686). * CI jobs always run sequentially. In other words, no single CI job can have two instances of the job running at the same time. @@ -41,8 +43,10 @@ corresponding to the CI job you want to manually start. #### Quota cleanup -Please ping someone who has access to the GCP project and ask them to -look at the quotas and clean up the leaked resources. +Please ping someone who has access to the GCP project. Ask them to +look at the quotas and delete the leaked resources by clicking the +delete button corresponding to those leaked resources on Google Cloud +Console. ## Presubmit @@ -59,15 +63,15 @@ look at the quotas and clean up the leaked resources. configuration later. Since recycling federated clusters is an expensive operation, we do not want to run this often. Hence, this job is configured to run once every 24 hours, around midnight - pacific time. + Pacific time. * Test job: This is the job that runs federation presubmit tests on every PR in the core repository, i.e. [kubernetes/kubernetes](https://github.com/kubernetes/kubernetes). These jobs can run in parallel on the PRs in the repository. -### Two-job setup +### Two-jobs setup -The deploy job runs once 24-hours roughly at around midnight pacific +The deploy job runs once every 24 hours at around midnight Pacific time. It is configured to turn up and tear down 3 federated clusters. It starts out by downloading the latest Kubernetes release built from [kubernetes/kubernetes](https://github.com/kubernetes/kubernetes) @@ -80,12 +84,12 @@ bucket will be overwritten. The test job on the other hand starts by copying the latest kubeconfig from the pre-configured GCS bucket. It uses this kubeconfig to deploy -a new federation control plane in the on one of the clusters in the -kubeconfig and joins all the clusters, including the host cluster -where federation control plane is deployed, as members to the newly -created federation control plane. It then runs the federation -presubmit tests on this control plane and tears down the control plane -in the end. +a new federation control plane on one of the clusters in the +kubeconfig. It then joins all the clusters in the kubeconfig, including +the host cluster where federation control plane is deployed, as members +to the newly created federation control plane. The test job then runs +the federation presubmit tests on this control plane and tears down the +control plane in the end. Since federated clusters are recycled only once every 24 hours, all presubmit runs in that period share the federated clusters. And since @@ -93,7 +97,7 @@ there could be multiple presubmit tests running in parallel, each instance of the test gets its own namespace where it deploys the federation control plane. These federation control planes deployed in separate namespaces are independent of each other and do not interfere -with other federation control planes in anyway. +with other federation control planes in any way. ### Configuration @@ -113,8 +117,10 @@ The configuration of the deploy job is stored in: #### Test job -The test job is configured in prow, but it runs in Jenkins mode. The -configuration steps are described in https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/README.md#create-a-new-job +The test job is +[configured in prow](https://github.com/kubernetes/test-infra/blob/35ceb37e999bb0589218708262634951b79dfe05/prow/config.yaml#L236), +but it runs in Jenkins mode. The configuration steps are described in +https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/README.md#create-a-new-job The configuration of the test job is stored in: @@ -136,9 +142,9 @@ individual tests that run against PRs in [kubernetes/kubernetes](https://github.com/kubernetes/kubernetes). * The metrics that we track are documented in https://github.com/kubernetes/test-infra/blob/0c56d2c9d32307c0a0f8fece85ef6919389e77fd/metrics/README.md#metrics. -* Job-level metrics are available in - http://storage.googleapis.com/k8s-metrics/job-flakes-latest.json. -* As of this writing, federation presubmits have a success rate of - 93.4% +* Job-level metrics are available in - [http://storage.googleapis.com/k8s-metrics/job-flakes-latest.json](). +* As of this writing, federation presubmits have a [success rate of + 93.4%](http://storage.googleapis.com/k8s-metrics/job-flakes-latest.json). ### Playbook @@ -148,12 +154,14 @@ Please ping someone who has access to the Jenkins UI/dashboard and ask them to login and click the "Build Now" link on the Jenkins page corresponding to the CI job you want to manually start. -#### Triggering a new deploy job run +#### Triggering a new test run -Use the @k8s-bot on the PR to retrigger the test. The exact bot -incantation is: `@k8s-bot pull-kubernetes-federation-e2e-gce test this` +Use the `/test` command on the PR to retrigger the test. The exact +incantation is: `/test pull-kubernetes-federation-e2e-gce` #### Quota cleanup -Please ping someone who has access to the GCP project and ask them to -look at the quotas and clean up the leaked resources. +Please ping someone who has access to `k8s-jkns-pr-bldr-e2e-gce-fdrtn` +GCP project. Ask them to look at the quotas and delete the leaked +resources by clicking the delete button corresponding to those leaked +resources on Google Cloud Console.