If /sys/kernel/mm/transparent_hugepage/enabled=always, the shim process
will use huge pages, which will consume a lot of memory.
Just like this:
ps -efo pid,rss,comm | grep shim
PID RSS COMMAND
2614 7464 containerd-shim
I don't think shim needs to use huge pages, and if we turn off the huge
pages option, we can save a lot of memory resources.
After we set THP_DISABLE=true:
ps -efo pid,comm,rss
PID COMMAND RSS
1629841 containerd-shim 5648
containerd
|
|--shim1 --start
|
|--shim2 (this shim will on host)
|
|--runc create (when containerd send create request by ttrpc)
|
|--runc init (this is the pid 1 in container)
we should set thp_disabled=1 in shim1 --start, because if we set this
in shim 2, the huge page has been setted while func main() running,
we set thp_disabled cannot change the setted huge pages.
So We need to set thp_disabled=1 in shim1 so that shim2 inherits the
settings of the parent process shim1, and shim2 has closed the
hugepage when it starts.
For runc processes, we need to set thp_disabled='before' in shim2 after
fork() and before execve(). So we use cmd.pre_exec to do this.
It's a very small change so I figured it's simpler to open a PR than an issue first.
The sync `state` method returns `Container` but for async returns `Vec<usize>`, and I couldn't locate an explanation for why these might be different so I assume it's a mistake. From a user perspective too I want Container rather than a usize vec.
Signed-off-by: Andrew Baxter <i@isandrew.com>
in cgroupv2 we should use the cgroups.proc file when adding a process (https://www.man7.org/linux/man-pages/man7/cgroups.7.html). The add_tasks function was writing to the cgroup.threads file which is only avaliable when in threaded mode. In either case our intent is to add the process not the individual threads to we should use add_task_by_tgid. See https://github.com/kata-containers/cgroups-rs/pull/104 for when this was added
Signed-off-by: James Sturtevant <jstur@microsoft.com>
This commits adds cgroup v2 support for collecting metrics in the shim.
Additionally, it uses CPU controller instead of the CPUAcct controller
for reporting CPU metrics back to containerd.
Signed-off-by: jiaxiao zhou <jiazho@microsoft.com>
The "read" side of container stdout/stderr fifo has been opened
by containerd and on the other hand "write" side is opened by
container process, which is a little different with golang shim.
If containerd shutdown and closed the read fd, container process
will receive EPIPE when writing to stdout/stderr and then be
killed by SIGPIPE signal. In this commit, the "read" side is
opened again by shim so that at least there is one opened "read"
side all the time.
Signed-off-by: Tianyang Zhang <burning9699@gmail.com>
Because the second invocation of the shim doesn't have the containerd pipe passed to it, a shim that wants to communicate over the pipe needs to parse the arguments its own. This makes it so the library pass all the arguments, which has already parsed the arguments allowing shims to use the containerd address.
Signed-off-by: James Sturtevant <jstur@microsoft.com>
This PR follows up on @wedsonaf's #126 and pass the filters through the list input.
Note that this is not backward compatible but this is a step closer to the protocol
conformance.
Bump containerd-shim-protos from 0.2.0 to 0.3.0 to
include changes made in #95.
Because the update in #95 is not compatible we need a major
version update.
Signed-off-by: Tim Zhang <tim@hyper.sh>
use File::from_raw_fd() to create the tty io file, when the io copy
thread ends, it will drop the file object which will close the fd, but
as we made three file objects from the same fd, it will be closed
three times, if other opened files occupied this fd number, the
second or third drop of the file object may close the fds of other files.
Signed-off-by: Zhang Tianyang <burning9699@gmail.com>
This allows us to use the `==` and `!=` operators to compare instances
of `Kind`, which is useful when we require that a snapshot be of some
specific kind (e.g., committed) before performing an operation on it.
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
This is needed by remote snapshotters: once they report "already
exists", containerd tries to find the snapshot via `list`. If
it's not implemented, the "already exists" trick to prevent
layer download doesn't work.
This is still missing a filtering function, but allows remote
snapshotters to work.
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
This allows us to use `serde` to serialize and deserialize the
information about a particular snapshot so that we can write it
to and read it from storage.
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
This is just a simplification of the code to use `map_err` and
the question mark operator to improve readability.
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
This allows implementers to express which grpc status should be
returned to clients on errors.
Prior to this patch, all errors got converted to "internal"
errors (`tonic::Status::internal`), which doesn't work when
specific status are needed. For example, remote snapshotters
need to return "already exists" to indicate that a layer
doesn't need to be downloaded.
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
The stdin fifo fd should be closed when containerd
calls close io, otherwise, runc-shim would hold
this fd and the copy console thread forever.
Signed-off-by: Zhang Tianyang <burning9699@gmail.com>
A pid file under bundle path will be created after
the exec process created, so it should also be deleted
before process deleted.
Signed-off-by: Zhang Tianyang <burning9699@gmail.com>
If an error is encountered when running the OCI command, shim process
should walk the OCI command logfile and retrieve the last error.
Signed-off-by: Zhang Tianyang <burning9699@gmail.com>