Previously, the HealthCheck exec session would not terminate on timeout, allowing the healthcheck to run indefinitely.
Fixes: https://issues.redhat.com/browse/RHEL-86096
Signed-off-by: Jan Rodák <hony.com@seznam.cz>
GoLang sets unset values to the default value of the type. This means that the destination of the log is an empty string and the count and size are set to 0. However, this means that size and count are unbounded, and this is not the default behavior.
Fixes: https://github.com/containers/podman/issues/25473
Fixes: https://issues.redhat.com/browse/RHEL-83262
Signed-off-by: Jan Rodák <hony.com@seznam.cz>
When starting a container consider healthcheck errors fatal. That way
user know when systemd-run failed to setup the timer to run the
healthcheck and we don't get into a state where the container is running
but not the healthcheck.
This also fixes the broken error reporting from the systemd-run exec, if
the binary could not be run the output was just empty leaving the users
with no idea what failed.
Fixes#25034
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
New flags in a `podman update` can change the configuration of HealthCheck when the container is started, without having to restart or recreate the container.
This can help determine why a given container suddenly started failing HealthCheck without interfering with the services it provides. For example, reconfigure HealthCheck to keep logs longer than the usual last X results, store logs to other destinations, etc.
Fixes: https://issues.redhat.com/browse/RHEL-60561
Signed-off-by: Jan Rodák <hony.com@seznam.cz>
The startup service is special because we have to transition from
startup to the normal unit. And in order to do so we kill ourselves (as
we are run as part of the service). This means we always exited 1 which
causes systemd to keep us failure and not remove the transient unit
unless "reset-failed" is called. As there is no process around to do
that we cannot really do this, thus make us exit(0) which makes more
sense.
Of course we could try to reset-failed the unit later but the code for
that seems more complicated than that.
Add a new test from Ed that ensures we check for all healthcheck units
not just the timer to avoid leaks. I slightly modified it to provide a
better error on leaks.
Fixes: 0bbef4b830 ("libpod: rework shutdown handler flow")
Fixes: #24351
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
These flags can affect the output of the HealtCheck log. Currently, when a container is configured with HealthCheck, the output from the HealthCheck command is only logged to the container status file, which is accessible via `podman inspect`.
It is also limited to the last five executions and the first 500 characters per execution.
This makes debugging past problems very difficult, since the only information available about the failure of the HealthCheck command is the generic `healthcheck service failed` record.
- The `--health-log-destination` flag sets the destination of the HealthCheck log.
- `none`: (default behavior) `HealthCheckResults` are stored in overlay containers. (For example: `$runroot/healthcheck.log`)
- `directory`: creates a log file named `<container-ID>-healthcheck.log` with JSON `HealthCheckResults` in the specified directory.
- `events_logger`: The log will be written with logging mechanism set by events_loggeri. It also saves the log to a default directory, for performance on a system with a large number of logs.
- The `--health-max-log-count` flag sets the maximum number of attempts in the HealthCheck log file.
- A value of `0` indicates an infinite number of attempts in the log file.
- The default value is `5` attempts in the log file.
- The `--health-max-log-size` flag sets the maximum length of the log stored.
- A value of `0` indicates an infinite log length.
- The default value is `500` log characters.
Add --health-max-log-count flag
Signed-off-by: Jan Rodák <hony.com@seznam.cz>
Add --health-max-log-size flag
Signed-off-by: Jan Rodák <hony.com@seznam.cz>
Add --health-log-destination flag
Signed-off-by: Jan Rodák <hony.com@seznam.cz>
Using the scanner is just unnecessary complicated an buggy as it will
not read the final line with a newline. There is also the problem that
it happens in a separate goroutine so it could loose output if we read
the array before the scanner was done.
The API accepts a Writer so we can just directly use a bytes.Buffer
which captures all output in memory without the need of another
goroutine.
This also means that now we always include the final newline in the
output. I checked with docker and they do the same so this is good.
Fixes#23332
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
This fixes a regression added in commit 4fd84190b8, because the name was
overwritten by the createTimer() timer call the removeTransientFiles()
call removed the new timer and not the startup healthcheck. And then
when the container was stopped we leaked it as the wrong unit name was
in the state.
A new test has been added to ensure the logic works and we never leak
the system timers.
Fixes#22884
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Checking if the file exists before opening it anyway is really pointless
and needs a extra syscall and in theory is racy as the file might have
been changed between the two calls. We can simply ignore the ENOENT
error on the ReadFile call.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
When the field is set to false we should never log healthcheck events.
Fixes https://issues.redhat.com/browse/RHEL-18987
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
We already know the status of the healthcheck in the caller so calling
healthCheckStatus() just make the event code sync the container state
and reread the healthcheck file for no reason.
It is much better to directly pass the status down to the event call.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Moving from Go module v4 to v5 prepares us for public releases.
Move done using gomove [1] as with the v3 and v4 moves.
[1] https://github.com/KSubedi/gomove
Signed-off-by: Matt Heon <mheon@redhat.com>
When InitialDelaySeconds in the kube yaml is set for a helthcheck,
don't update the healthcheck status till those initial delay seconds are over.
We were waiting to update for a failing healtcheck, but when the healthcheck
was successful during the initial delay time, the status was being updated as healthy
immediately.
This is misleading to the users wondering why their healthcheck takes
much longer to fail for a failing case while it is quick to succeed for
a healthy case. It also doesn't match what the k8s InitialDelaySeconds
does. This change is only for kube play, podman healthcheck run is
unaffected.
Signed-off-by: Urvashi Mohnani <umohnani@redhat.com>
HC events were firing as part of the `exec` call, before it had
even been decided whether the HC succeeded or failed. As such,
the status was not going to be correct any time there was a
change (e.g. the first event after a container went healthy to
unhealthy would still read healthy). Move the event into the
actual Healthcheck function and throw it in a defer to make sure
it happens at the very end, after logs are written.
Ignores several conditions that did not log previously (container
in question does not have a healthcheck, or an internal failure
that should not really happen).
Still not a perfect solution. This relies on the HC log being
written, when instead we could just get the status straight from
the function writing the event - so if we fail to write the log,
we can still report a bad status. But if the log wasn't written,
we're in bad shape regardless - `podman ps` would disagree with
the event written, for example.
Fixes#19237
Signed-off-by: Matt Heon <mheon@redhat.com>
As described in #17777, the `restart` on-failure action did not behave
correctly when the health check is being run by a transient systemd
unit. It ran just fine when being executed outside such a unit, for
instance, manually or, as done in the system tests, in a scripted
fashion.
There were two issue causing the `restart` on-failure action to
misbehave:
1) The transient systemd units used the default `KillMode=cgroup` which
will nuke all processes in the specific cgroup including the recently
restarted container/conmon once the main `podman healthcheck run`
process exits.
2) Podman attempted to remove the transient systemd unit and timer
during restart. That is perfectly fine when manually restarting the
container but not when the restart itself is being executed inside
such a transient unit. Ultimately, Podman tried to shoot itself in
the foot.
Fix both issues by moving the restart logic in the cleanup process.
Instead of restarting the container, the `healthcheck run` will just
stop the container and the cleanup process will restart the container
once it has turned unhealthy.
Fixes: #17777
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Also do not return (and immediately suppress) an error if no health
check is defined for a given container.
Makes listing 100 containers around 10 percent faster.
[NO NEW TESTS NEEDED]
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
The podman healthchecks are implemented using systemd timers, this works
great but it will never work on non systemd distros. Currently the logic
always assumes systemd is available and will fail with an error, so users
are forced to always run with `--no-healthcheck` to disable healthchecks
that are defined in an image for example. This is annoying and IMO
unnecessary, we should just default to no healthcheck on these systems.
First, use the systemd build tag to disable it at build time if this tag
is not used.
Second, use make sure systemd is used as init before trying
to use healthchecks. This could be the case when we are run in a container.
[NO NEW TESTS NEEDED] We do not have any non systemd VMs in CI AFAIK.
Fixes#16644
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Startup healthchecks are similar to K8S startup probes, in that
they are a separate check from the regular healthcheck that runs
before it. If the startup healthcheck fails repeatedly, the
associated container is restarted.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
Make sure that the on-failure actions only kick in once the health check
has passed its retries. Also fix race conditions on reading/writing the
log.
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Package `io/ioutil` was deprecated in golang 1.16, preventing podman from
building under Fedora 37. Fortunately, functionality identical
replacements are provided by the packages `io` and `os`. Replace all
usage of all `io/ioutil` symbols with appropriate substitutions
according to the golang docs.
Signed-off-by: Chris Evich <cevich@redhat.com>
For systems that have extreme robustness requirements (edge devices,
particularly those in difficult to access environments), it is important
that applications continue running in all circumstances. When the
application fails, Podman must restart it automatically to provide this
robustness. Otherwise, these devices may require customer IT to
physically gain access to restart, which can be prohibitively difficult.
Add a new `--on-failure` flag that supports four actions:
- **none**: Take no action.
- **kill**: Kill the container.
- **restart**: Restart the container. Do not combine the `restart`
action with the `--restart` flag. When running inside of
a systemd unit, consider using the `kill` or `stop`
action instead to make use of systemd's restart policy.
- **stop**: Stop the container.
To remain backwards compatible, **none** is the default action.
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
We now use the golang error wrapping format specifier `%w` instead of
the deprecated github.com/pkg/errors package.
[NO NEW TESTS NEEDED]
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Previously, if a container had healthchecks disabled in the
docker-compose.yml file and the user did a `podman inspect <container>`,
they would have an incorrect output:
```
"Healthcheck":{
"Test":[
"CMD-SHELL",
"NONE"
],
"Interval":30000000000,
"Timeout":30000000000,
"Retries":3
}
```
After a quick change, the correct output is now the result:
```
"Healthcheck":{
"Test":[
"NONE"
]
}
```
Additionally, I extracted the hard-coded strings that were used for
comparisons into constants in `libpod/define` to prevent a similar issue
from recurring.
Closes: #14493
Signed-off-by: Jake Correnti <jcorrenti13@gmail.com>
Previously, health status events were not being generated at all. Both
the API and `podman events` will generate health_status events.
```
{"status":"health_status","id":"ae498ac3aa6c63db8b69a37583a6eae1a9cefbdbdbeeadcf8e1d66d745f0df63","from":"localhost/healthcheck-demo:latest","Type":"container","Action":"health_status","Actor":{"ID":"ae498ac3aa6c63db8b69a37583a6eae1a9cefbdbdbeeadcf8e1d66d745f0df63","Attributes":{"containerExitCode":"0","image":"localhost/healthcheck-demo:latest","io.buildah.version":"1.26.1","maintainer":"NGINX Docker Maintainers \u003cdocker-maint@nginx.com\u003e","name":"healthcheck-demo"}},"scope":"local","time":1656082205,"timeNano":1656082205882271276,"HealthStatus":"healthy"}
```
```
2022-06-24 11:06:04.886238493 -0400 EDT container health_status ae498ac3aa6c63db8b69a37583a6eae1a9cefbdbdbeeadcf8e1d66d745f0df63 (image=localhost/healthcheck-demo:latest, name=healthcheck-demo, health_status=healthy, io.buildah.version=1.26.1, maintainer=NGINX Docker Maintainers <docker-maint@nginx.com>)
```
Signed-off-by: Jake Correnti <jcorrenti13@gmail.com>
* Replace "setup", "lookup", "cleanup", "backup" with
"set up", "look up", "clean up", "back up"
when used as verbs. Replace also variations of those.
* Improve language in a few places.
Signed-off-by: Erik Sjölund <erik.sjolund@gmail.com>
It seems we are ignoring output from healthcheck session.
Open a valid pipe to healthcheck session in order read its output.
Use common pipe for both `stdout/stderr` since that was the previous
behviour as well.
Signed-off-by: Aditya R <arajan@redhat.com>
The health check result is stored in the container state. Since the
state can change or might not even be set we have to retrive the current
state before we try to read the health check result.
Fixes#11687
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
We missed bumping the go module, so let's do it now :)
* Automated go code with github.com/sirkon/go-imports-rename
* Manually via `vgrep podman/v2` the rest
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
Most of the builtin golang functions like os.Stat and
os.Open report errors including the file system object
path. We should not wrap these errors and put the file path
in a second time, causing stuttering of errors when they
get presented to the user.
This patch tries to cleanup a bunch of these errors.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
This was added with an earlier exec rework, and honestly is very
confusing. Podman is printing an error message, but the error had
nothing to do with Podman; it was the executable we ran inside
the container that errored, and per `podman run` convention we
should set the Podman exit code to the process's exit code and
print no error.
Signed-off-by: Matthew Heon <mheon@redhat.com>
With the advent of Podman 2.0.0 we crossed the magical barrier of go
modules. While we were able to continue importing all packages inside
of the project, the project could not be vendored anymore from the
outside.
Move the go module to new major version and change all imports to
`github.com/containers/libpod/v2`. The renaming of the imports
was done via `gomove` [1].
[1] https://github.com/KSubedi/gomove
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
instead of using the container log path to derive where to put the healthchecks, we now put them into the rundir to avoid collision of health check log files when the log path is set by user.
Fixes: #5915
Signed-off-by: Brent Baude <bbaude@redhat.com>
add the ability to attach to a running container. the tunnel side of this is not enabled yet as we have work on the endpoints and plumbing to do yet.
add the ability to exec a command in a running container. the tunnel side is also being deferred for same reason.
Signed-off-by: Brent Baude <bbaude@redhat.com>
As part of the rework of exec sessions, we need to address them
independently of containers. In the new API, we need to be able
to fetch them by their ID, regardless of what container they are
associated with. Unfortunately, our existing exec sessions are
tied to individual containers; there's no way to tell what
container a session belongs to and retrieve it without getting
every exec session for every container.
This adds a pointer to the container an exec session is
associated with to the database. The sessions themselves are
still stored in the container.
Exec-related APIs have been restructured to work with the new
database representation. The originally monolithic API has been
split into a number of smaller calls to allow more fine-grained
control of lifecycle. Support for legacy exec sessions has been
retained, but in a deprecated fashion; we should remove this in
a few releases.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
`gocritic` is a powerful linter that helps in preventing certain kinds
of errors as well as enforcing a coding style.
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
There were many situations that made exec act funky with input. pipes didn't work as expected, as well as sending input before the shell opened.
Thinking about it, it seemed as though the issues were because of how os.Stdin buffers (it doesn't). Dropping this input had some weird consequences.
Instead, read from os.Stdin as bufio.Reader, allowing the input to buffer before passing it to the container.
Signed-off-by: Peter Hunt <pehunt@redhat.com>