We don't call klog.InitFlags yet, because that will cause a flag
redefinition error until we get everyone to stop using glog. That
will happen when we update to k8s 1.13.
The launch configuration test exposed that our integration tests don't
retry for very long, and wait a long time in between retries.
Create a RunTasksOptions type to hold the parameters, in particular
max task time, and the amount of time we wait when all tasks have
failed.
This PR implmenents a new custom error that is returned when a task
lifecycle set to fi.LifecycleExistsAndWarnIfChanges. This will allow
a task to to fail validation, but the task is marked as completed and
the error is cleared.
This:
- reworks how retries are handled in fi/executor.go to a time-based scheme
- changes the single-task limit to 10m (from about 30s of no-progress)
- eliminates the inner IAM propagation retry for LaunchConfigurations,
because the task itself will just be redriven for a while. This also
eliminates any long-pole delay caused by this error (since task Run()
should be 'fast').
Now that we're running in parallel, sometimes AWS eventual consistency
causes us problems. We now retry up to 3 times, sleeping 10 seconds in
between each run even when we aren't making progress.
If there is an error performing a task, we will reattempt it as long as
forward progress is still being made (i.e. at least one other task
completed successfully)
This makes everything more reliable (though we should still fix these
problems), but it also lays the groundwork for parallel execution.