From 03190b4bedac819dbf80e70e1574ed0f43ab2cda Mon Sep 17 00:00:00 2001 From: eduartua Date: Wed, 30 Jan 2019 13:31:33 -0600 Subject: [PATCH] file writing-good-e2e-tests.md moved to the new folder /devel/sig-testing - URLs updated - tombstone file created --- .../sig-testing/writing-good-e2e-tests.md | 231 +++++++++++++++++ contributors/devel/writing-good-e2e-tests.md | 232 +----------------- events/elections/2017/quintonhoole_bio.md | 2 +- events/elections/2018/quintonhoole.md | 2 +- 4 files changed, 235 insertions(+), 232 deletions(-) create mode 100644 contributors/devel/sig-testing/writing-good-e2e-tests.md diff --git a/contributors/devel/sig-testing/writing-good-e2e-tests.md b/contributors/devel/sig-testing/writing-good-e2e-tests.md new file mode 100644 index 000000000..836479c2e --- /dev/null +++ b/contributors/devel/sig-testing/writing-good-e2e-tests.md @@ -0,0 +1,231 @@ +# Writing good e2e tests for Kubernetes # + +## Patterns and Anti-Patterns ## + +### Goals of e2e tests ### + +Beyond the obvious goal of providing end-to-end system test coverage, +there are a few less obvious goals that you should bear in mind when +designing, writing and debugging your end-to-end tests. In +particular, "flaky" tests, which pass most of the time but fail +intermittently for difficult-to-diagnose reasons are extremely costly +in terms of blurring our regression signals and slowing down our +automated merge velocity. Up-front time and effort designing your test +to be reliable is very well spent. Bear in mind that we have hundreds +of tests, each running in dozens of different environments, and if any +test in any test environment fails, we have to assume that we +potentially have some sort of regression. So if a significant number +of tests fail even only 1% of the time, basic statistics dictates that +we will almost never have a "green" regression indicator. Stated +another way, writing a test that is only 99% reliable is just about +useless in the harsh reality of a CI environment. In fact it's worse +than useless, because not only does it not provide a reliable +regression indicator, but it also costs a lot of subsequent debugging +time, and delayed merges. + +#### Debuggability #### + +If your test fails, it should provide as detailed as possible reasons +for the failure in its output. "Timeout" is not a useful error +message. "Timed out after 60 seconds waiting for pod xxx to enter +running state, still in pending state" is much more useful to someone +trying to figure out why your test failed and what to do about it. +Specifically, +[assertion](https://onsi.github.io/gomega/#making-assertions) code +like the following generates rather useless errors: + +``` +Expect(err).NotTo(HaveOccurred()) +``` + +Rather +[annotate](https://onsi.github.io/gomega/#annotating-assertions) your assertion with something like this: + +``` +Expect(err).NotTo(HaveOccurred(), "Failed to create %d foobars, only created %d", foobarsReqd, foobarsCreated) +``` + +On the other hand, overly verbose logging, particularly of non-error conditions, can make +it unnecessarily difficult to figure out whether a test failed and if +so why? So don't log lots of irrelevant stuff either. + +#### Ability to run in non-dedicated test clusters #### + +To reduce end-to-end delay and improve resource utilization when +running e2e tests, we try, where possible, to run large numbers of +tests in parallel against the same test cluster. This means that: + +1. you should avoid making any assumption (implicit or explicit) that +your test is the only thing running against the cluster. For example, +making the assumption that your test can run a pod on every node in a +cluster is not a safe assumption, as some other tests, running at the +same time as yours, might have saturated one or more nodes in the +cluster. Similarly, running a pod in the system namespace, and +assuming that will increase the count of pods in the system +namespace by one is not safe, as some other test might be creating or +deleting pods in the system namespace at the same time as your test. +If you do legitimately need to write a test like that, make sure to +label it ["\[Serial\]"](e2e-tests.md#kinds-of-tests) so that it's easy +to identify, and not run in parallel with any other tests. +1. You should avoid doing things to the cluster that make it difficult +for other tests to reliably do what they're trying to do, at the same +time. For example, rebooting nodes, disconnecting network interfaces, +or upgrading cluster software as part of your test is likely to +violate the assumptions that other tests might have made about a +reasonably stable cluster environment. If you need to write such +tests, please label them as +["\[Disruptive\]"](e2e-tests.md#kinds-of-tests) so that it's easy to +identify them, and not run them in parallel with other tests. +1. You should avoid making assumptions about the Kubernetes API that +are not part of the API specification, as your tests will break as +soon as these assumptions become invalid. For example, relying on +specific Events, Event reasons or Event messages will make your tests +very brittle. + +#### Speed of execution #### + +We have hundreds of e2e tests, some of which we run in serial, one +after the other, in some cases. If each test takes just a few minutes +to run, that very quickly adds up to many, many hours of total +execution time. We try to keep such total execution time down to a +few tens of minutes at most. Therefore, try (very hard) to keep the +execution time of your individual tests below 2 minutes, ideally +shorter than that. Concretely, adding inappropriately long 'sleep' +statements or other gratuitous waits to tests is a killer. If under +normal circumstances your pod enters the running state within 10 +seconds, and 99.9% of the time within 30 seconds, it would be +gratuitous to wait 5 minutes for this to happen. Rather just fail +after 30 seconds, with a clear error message as to why your test +failed ("e.g. Pod x failed to become ready after 30 seconds, it +usually takes 10 seconds"). If you do have a truly legitimate reason +for waiting longer than that, or writing a test which takes longer +than 2 minutes to run, comment very clearly in the code why this is +necessary, and label the test as +["\[Slow\]"](e2e-tests.md#kinds-of-tests), so that it's easy to +identify and avoid in test runs that are required to complete +timeously (for example those that are run against every code +submission before it is allowed to be merged). +Note that completing within, say, 2 minutes only when the test +passes is not generally good enough. Your test should also fail in a +reasonable time. We have seen tests that, for example, wait up to 10 +minutes for each of several pods to become ready. Under good +conditions these tests might pass within a few seconds, but if the +pods never become ready (e.g. due to a system regression) they take a +very long time to fail and typically cause the entire test run to time +out, so that no results are produced. Again, this is a lot less +useful than a test that fails reliably within a minute or two when the +system is not working correctly. + +#### Resilience to relatively rare, temporary infrastructure glitches or delays #### + +Remember that your test will be run many thousands of +times, at different times of day and night, probably on different +cloud providers, under different load conditions. And often the +underlying state of these systems is stored in eventually consistent +data stores. So, for example, if a resource creation request is +theoretically asynchronous, even if you observe it to be practically +synchronous most of the time, write your test to assume that it's +asynchronous (e.g. make the "create" call, and poll or watch the +resource until it's in the correct state before proceeding). +Similarly, don't assume that API endpoints are 100% available. +They're not. Under high load conditions, API calls might temporarily +fail or time-out. In such cases it's appropriate to back off and retry +a few times before failing your test completely (in which case make +the error message very clear about what happened, e.g. "Retried +http://... 3 times - all failed with xxx". Use the standard +retry mechanisms provided in the libraries detailed below. + +### Some concrete tools at your disposal ### + +Obviously most of the above goals apply to many tests, not just yours. +So we've developed a set of reusable test infrastructure, libraries +and best practices to help you to do the right thing, or at least do +the same thing as other tests, so that if that turns out to be the +wrong thing, it can be fixed in one place, not hundreds, to be the +right thing. + +Here are a few pointers: + ++ [E2e Framework](https://git.k8s.io/kubernetes/test/e2e/framework/framework.go): + Familiarise yourself with this test framework and how to use it. + Amongst others, it automatically creates uniquely named namespaces + within which your tests can run to avoid name clashes, and reliably + automates cleaning up the mess after your test has completed (it + just deletes everything in the namespace). This helps to ensure + that tests do not leak resources. Note that deleting a namespace + (and by implication everything in it) is currently an expensive + operation. So the fewer resources you create, the less cleaning up + the framework needs to do, and the faster your test (and other + tests running concurrently with yours) will complete. Your tests + should always use this framework. Trying other home-grown + approaches to avoiding name clashes and resource leaks has proven + to be a very bad idea. ++ [E2e utils library](https://git.k8s.io/kubernetes/test/e2e/framework/util.go): + This handy library provides tons of reusable code for a host of + commonly needed test functionality, including waiting for resources + to enter specified states, safely and consistently retrying failed + operations, usefully reporting errors, and much more. Make sure + that you're familiar with what's available there, and use it. + Likewise, if you come across a generally useful mechanism that's + not yet implemented there, add it so that others can benefit from + your brilliance. In particular pay attention to the variety of + timeout and retry related constants at the top of that file. Always + try to reuse these constants rather than try to dream up your own + values. Even if the values there are not precisely what you would + like to use (timeout periods, retry counts etc), the benefit of + having them be consistent and centrally configurable across our + entire test suite typically outweighs your personal preferences. ++ **Follow the examples of stable, well-written tests:** Some of our + existing end-to-end tests are better written and more reliable than + others. A few examples of well-written tests include: + [Replication Controllers](https://git.k8s.io/kubernetes/test/e2e/apps/rc.go), + [Services](https://git.k8s.io/kubernetes/test/e2e/network/service.go), + [Reboot](https://git.k8s.io/kubernetes/test/e2e/lifecycle/reboot.go). ++ [Ginkgo Test Framework](https://github.com/onsi/ginkgo): This is the + test library and runner upon which our e2e tests are built. Before + you write or refactor a test, read the docs and make sure that you + understand how it works. In particular be aware that every test is + uniquely identified and described (e.g. in test reports) by the + concatenation of its `Describe` clause and nested `It` clauses. + So for example `Describe("Pods",...).... It(""should be scheduled + with cpu and memory limits")` produces a sane test identifier and + descriptor `Pods should be scheduled with cpu and memory limits`, + which makes it clear what's being tested, and hence what's not + working if it fails. Other good examples include: + +``` + CAdvisor should be healthy on every node +``` + +and + +``` + Daemon set should run and stop complex daemon +``` + + On the contrary +(these are real examples), the following are less good test +descriptors: + +``` + KubeProxy should test kube-proxy +``` + +and + +``` +Nodes [Disruptive] Network when a node becomes unreachable +[replication controller] recreates pods scheduled on the +unreachable node AND allows scheduling of pods on a node after +it rejoins the cluster +``` + +An improvement might be + +``` +Unreachable nodes are evacuated and then repopulated upon rejoining [Disruptive] +``` + +Note that opening issues for specific better tooling is welcome, and +code implementing that tooling is even more welcome :-). + diff --git a/contributors/devel/writing-good-e2e-tests.md b/contributors/devel/writing-good-e2e-tests.md index 836479c2e..b39208eb7 100644 --- a/contributors/devel/writing-good-e2e-tests.md +++ b/contributors/devel/writing-good-e2e-tests.md @@ -1,231 +1,3 @@ -# Writing good e2e tests for Kubernetes # - -## Patterns and Anti-Patterns ## - -### Goals of e2e tests ### - -Beyond the obvious goal of providing end-to-end system test coverage, -there are a few less obvious goals that you should bear in mind when -designing, writing and debugging your end-to-end tests. In -particular, "flaky" tests, which pass most of the time but fail -intermittently for difficult-to-diagnose reasons are extremely costly -in terms of blurring our regression signals and slowing down our -automated merge velocity. Up-front time and effort designing your test -to be reliable is very well spent. Bear in mind that we have hundreds -of tests, each running in dozens of different environments, and if any -test in any test environment fails, we have to assume that we -potentially have some sort of regression. So if a significant number -of tests fail even only 1% of the time, basic statistics dictates that -we will almost never have a "green" regression indicator. Stated -another way, writing a test that is only 99% reliable is just about -useless in the harsh reality of a CI environment. In fact it's worse -than useless, because not only does it not provide a reliable -regression indicator, but it also costs a lot of subsequent debugging -time, and delayed merges. - -#### Debuggability #### - -If your test fails, it should provide as detailed as possible reasons -for the failure in its output. "Timeout" is not a useful error -message. "Timed out after 60 seconds waiting for pod xxx to enter -running state, still in pending state" is much more useful to someone -trying to figure out why your test failed and what to do about it. -Specifically, -[assertion](https://onsi.github.io/gomega/#making-assertions) code -like the following generates rather useless errors: - -``` -Expect(err).NotTo(HaveOccurred()) -``` - -Rather -[annotate](https://onsi.github.io/gomega/#annotating-assertions) your assertion with something like this: - -``` -Expect(err).NotTo(HaveOccurred(), "Failed to create %d foobars, only created %d", foobarsReqd, foobarsCreated) -``` - -On the other hand, overly verbose logging, particularly of non-error conditions, can make -it unnecessarily difficult to figure out whether a test failed and if -so why? So don't log lots of irrelevant stuff either. - -#### Ability to run in non-dedicated test clusters #### - -To reduce end-to-end delay and improve resource utilization when -running e2e tests, we try, where possible, to run large numbers of -tests in parallel against the same test cluster. This means that: - -1. you should avoid making any assumption (implicit or explicit) that -your test is the only thing running against the cluster. For example, -making the assumption that your test can run a pod on every node in a -cluster is not a safe assumption, as some other tests, running at the -same time as yours, might have saturated one or more nodes in the -cluster. Similarly, running a pod in the system namespace, and -assuming that will increase the count of pods in the system -namespace by one is not safe, as some other test might be creating or -deleting pods in the system namespace at the same time as your test. -If you do legitimately need to write a test like that, make sure to -label it ["\[Serial\]"](e2e-tests.md#kinds-of-tests) so that it's easy -to identify, and not run in parallel with any other tests. -1. You should avoid doing things to the cluster that make it difficult -for other tests to reliably do what they're trying to do, at the same -time. For example, rebooting nodes, disconnecting network interfaces, -or upgrading cluster software as part of your test is likely to -violate the assumptions that other tests might have made about a -reasonably stable cluster environment. If you need to write such -tests, please label them as -["\[Disruptive\]"](e2e-tests.md#kinds-of-tests) so that it's easy to -identify them, and not run them in parallel with other tests. -1. You should avoid making assumptions about the Kubernetes API that -are not part of the API specification, as your tests will break as -soon as these assumptions become invalid. For example, relying on -specific Events, Event reasons or Event messages will make your tests -very brittle. - -#### Speed of execution #### - -We have hundreds of e2e tests, some of which we run in serial, one -after the other, in some cases. If each test takes just a few minutes -to run, that very quickly adds up to many, many hours of total -execution time. We try to keep such total execution time down to a -few tens of minutes at most. Therefore, try (very hard) to keep the -execution time of your individual tests below 2 minutes, ideally -shorter than that. Concretely, adding inappropriately long 'sleep' -statements or other gratuitous waits to tests is a killer. If under -normal circumstances your pod enters the running state within 10 -seconds, and 99.9% of the time within 30 seconds, it would be -gratuitous to wait 5 minutes for this to happen. Rather just fail -after 30 seconds, with a clear error message as to why your test -failed ("e.g. Pod x failed to become ready after 30 seconds, it -usually takes 10 seconds"). If you do have a truly legitimate reason -for waiting longer than that, or writing a test which takes longer -than 2 minutes to run, comment very clearly in the code why this is -necessary, and label the test as -["\[Slow\]"](e2e-tests.md#kinds-of-tests), so that it's easy to -identify and avoid in test runs that are required to complete -timeously (for example those that are run against every code -submission before it is allowed to be merged). -Note that completing within, say, 2 minutes only when the test -passes is not generally good enough. Your test should also fail in a -reasonable time. We have seen tests that, for example, wait up to 10 -minutes for each of several pods to become ready. Under good -conditions these tests might pass within a few seconds, but if the -pods never become ready (e.g. due to a system regression) they take a -very long time to fail and typically cause the entire test run to time -out, so that no results are produced. Again, this is a lot less -useful than a test that fails reliably within a minute or two when the -system is not working correctly. - -#### Resilience to relatively rare, temporary infrastructure glitches or delays #### - -Remember that your test will be run many thousands of -times, at different times of day and night, probably on different -cloud providers, under different load conditions. And often the -underlying state of these systems is stored in eventually consistent -data stores. So, for example, if a resource creation request is -theoretically asynchronous, even if you observe it to be practically -synchronous most of the time, write your test to assume that it's -asynchronous (e.g. make the "create" call, and poll or watch the -resource until it's in the correct state before proceeding). -Similarly, don't assume that API endpoints are 100% available. -They're not. Under high load conditions, API calls might temporarily -fail or time-out. In such cases it's appropriate to back off and retry -a few times before failing your test completely (in which case make -the error message very clear about what happened, e.g. "Retried -http://... 3 times - all failed with xxx". Use the standard -retry mechanisms provided in the libraries detailed below. - -### Some concrete tools at your disposal ### - -Obviously most of the above goals apply to many tests, not just yours. -So we've developed a set of reusable test infrastructure, libraries -and best practices to help you to do the right thing, or at least do -the same thing as other tests, so that if that turns out to be the -wrong thing, it can be fixed in one place, not hundreds, to be the -right thing. - -Here are a few pointers: - -+ [E2e Framework](https://git.k8s.io/kubernetes/test/e2e/framework/framework.go): - Familiarise yourself with this test framework and how to use it. - Amongst others, it automatically creates uniquely named namespaces - within which your tests can run to avoid name clashes, and reliably - automates cleaning up the mess after your test has completed (it - just deletes everything in the namespace). This helps to ensure - that tests do not leak resources. Note that deleting a namespace - (and by implication everything in it) is currently an expensive - operation. So the fewer resources you create, the less cleaning up - the framework needs to do, and the faster your test (and other - tests running concurrently with yours) will complete. Your tests - should always use this framework. Trying other home-grown - approaches to avoiding name clashes and resource leaks has proven - to be a very bad idea. -+ [E2e utils library](https://git.k8s.io/kubernetes/test/e2e/framework/util.go): - This handy library provides tons of reusable code for a host of - commonly needed test functionality, including waiting for resources - to enter specified states, safely and consistently retrying failed - operations, usefully reporting errors, and much more. Make sure - that you're familiar with what's available there, and use it. - Likewise, if you come across a generally useful mechanism that's - not yet implemented there, add it so that others can benefit from - your brilliance. In particular pay attention to the variety of - timeout and retry related constants at the top of that file. Always - try to reuse these constants rather than try to dream up your own - values. Even if the values there are not precisely what you would - like to use (timeout periods, retry counts etc), the benefit of - having them be consistent and centrally configurable across our - entire test suite typically outweighs your personal preferences. -+ **Follow the examples of stable, well-written tests:** Some of our - existing end-to-end tests are better written and more reliable than - others. A few examples of well-written tests include: - [Replication Controllers](https://git.k8s.io/kubernetes/test/e2e/apps/rc.go), - [Services](https://git.k8s.io/kubernetes/test/e2e/network/service.go), - [Reboot](https://git.k8s.io/kubernetes/test/e2e/lifecycle/reboot.go). -+ [Ginkgo Test Framework](https://github.com/onsi/ginkgo): This is the - test library and runner upon which our e2e tests are built. Before - you write or refactor a test, read the docs and make sure that you - understand how it works. In particular be aware that every test is - uniquely identified and described (e.g. in test reports) by the - concatenation of its `Describe` clause and nested `It` clauses. - So for example `Describe("Pods",...).... It(""should be scheduled - with cpu and memory limits")` produces a sane test identifier and - descriptor `Pods should be scheduled with cpu and memory limits`, - which makes it clear what's being tested, and hence what's not - working if it fails. Other good examples include: - -``` - CAdvisor should be healthy on every node -``` - -and - -``` - Daemon set should run and stop complex daemon -``` - - On the contrary -(these are real examples), the following are less good test -descriptors: - -``` - KubeProxy should test kube-proxy -``` - -and - -``` -Nodes [Disruptive] Network when a node becomes unreachable -[replication controller] recreates pods scheduled on the -unreachable node AND allows scheduling of pods on a node after -it rejoins the cluster -``` - -An improvement might be - -``` -Unreachable nodes are evacuated and then repopulated upon rejoining [Disruptive] -``` - -Note that opening issues for specific better tooling is welcome, and -code implementing that tooling is even more welcome :-). +This file has moved to https://git.k8s.io/community/contributors/devel/sig-testing/writing-good-e2e-tests.md. +This file is a placeholder to preserve links. Please remove by April 30, 2019 or the release of kubernetes 1.13, whichever comes first. \ No newline at end of file diff --git a/events/elections/2017/quintonhoole_bio.md b/events/elections/2017/quintonhoole_bio.md index d72d6aa00..cee52cdd8 100644 --- a/events/elections/2017/quintonhoole_bio.md +++ b/events/elections/2017/quintonhoole_bio.md @@ -28,7 +28,7 @@ I have an M.Sc in Computer Science. # Highlights of What I've done on Kubernetes thus far Initially I led the effort to get our CI Testing stuff initiated and -[working effectively](https://github.com/kubernetes/community/blob/master/contributors/devel/writing-good-e2e-tests.md), which was critical to being able to launch v1.0 successfully, +[working effectively](/contributors/devel/sig-testing/writing-good-e2e-tests.md), which was critical to being able to launch v1.0 successfully, way back in July 2015. Around that time I handed it over to the now-famous SIG-Testing. After that I initiated and led the [Cluster diff --git a/events/elections/2018/quintonhoole.md b/events/elections/2018/quintonhoole.md index 416e8f658..0120e91a8 100644 --- a/events/elections/2018/quintonhoole.md +++ b/events/elections/2018/quintonhoole.md @@ -30,7 +30,7 @@ I have an M.Sc in Computer Science. 1. steering committee member since October 2017. 2. active contributor to [SIG-Architecture](https://github.com/kubernetes/community/tree/master/sig-architecture). 3. lead [SIG-Multicluster](https://github.com/kubernetes/community/tree/master/sig-multicluster) -4. early lead getting CI Testing [working effectively](https://github.com/kubernetes/community/blob/master/contributors/devel/writing-good-e2e-tests.md), critical to v1.0 launch. +4. early lead getting CI Testing [working effectively](/contributors/devel/sig-testing/writing-good-e2e-tests.md), critical to v1.0 launch. 5. Helped [SIG-Scalability set key objectives](https://github.com/kubernetes/community/blob/master/sig-scalability/goals.md).