This:
- reworks how retries are handled in fi/executor.go to a time-based scheme
- changes the single-task limit to 10m (from about 30s of no-progress)
- eliminates the inner IAM propagation retry for LaunchConfigurations,
because the task itself will just be redriven for a while. This also
eliminates any long-pole delay caused by this error (since task Run()
should be 'fast').
Now that we're running in parallel, sometimes AWS eventual consistency
causes us problems. We now retry up to 3 times, sleeping 10 seconds in
between each run even when we aren't making progress.
If there is an error performing a task, we will reattempt it as long as
forward progress is still being made (i.e. at least one other task
completed successfully)
This makes everything more reliable (though we should still fix these
problems), but it also lays the groundwork for parallel execution.