Runc
March 12, 2021
runC code review
The Open Container Initiative https://opencontainers.org/about/overview/[OCI] maintains specifications for standards on Operating System process and application containers. Currently, OCI has two sepcifications:
- Runtime Specification (https://github.com/opencontainers/runtime-spec) and
- Image Specification (https://github.com/opencontainers/image-spec).
The runtime-spec is concerned with the runtime that can be used to run an OCI compliant image-spec on a given OS platform. You can find OCI compliant run-times here https://github.com/opencontainers/runtime-spec/blob/master/implementations.md.
runC https://github.com/opencontainers/runc is a CLI tool for spawning and running containers according to the OCI specification.
1. Introduction
Am not going to through all the code,but am going to focus on the main aspects esepecially on Linux. Also some of the basics will be taken for greanted, so if you need a refresher follow the links above.
2. Big picture
As per the run-time spec, runC
must support the following operations:
-
state
This operations checks (queries) the state of an existing container.state <container-id>
-
create
- This operation creates a new container based on the config.json input file.create <container-id> <path-bundle>
-
start
- This operations starts the process specified in the container created in step 2 above. The container must be existing.start <container-id>
-
kill
- sends the signal specified to the running process in the conatiner.kill <container-id> <signal>
-
delete
- deletes the container, which must be in a running state.
delete <container-id>
The runC codebase is go
and a tiny amount of c
.
Code base layout
Creating a container
When runC
is invoked
runc create <container-id>
- Every time
runc
is called, the code insidegithub.com/opencontainers/runc/libcontainer/nsenter
will execute (before the Go runtime boots). However, because the"_LIBCONTAINER_INITPIPE"
environment variable is not set[1] then that code just exits and the Go runtime takes over.
|
|
-
runc create – checks the bundle and figures out what needs to be done, as we will see later in details. Then it creates a
runc init
subprocess (which will eventually become the PID 1 process inside the container) – with"_LIBCONTAINER_INITPIPE"
set to an fd used for communication between the two processes. -
runc init –
nsenter
runs and because"_LIBCONTAINER_INITPIPE"
is set to a valid fdnsexec()
will run and set up all of the namespaces that were set inconfig.json
. This involves creating a bunch of processes, and at the end it will return and the Go runtime will boot up. -
runc init – After
nsexec()
returns, thenfactory.StartInitialization()
insideinit.go
runs. It will run all of the Go initialization that is necessary to set up a container. -
runc init – at the very end of all of that,
runc init
stops executing and waits to be told toexecve(2)
the container process (as set in config.json). -
runc create – exits because it has nothing else to do.
Running a container
- runc start – signals the
runc init
process to start running the user’s process inside the container by reading from the FIFO to which the create process writes to.
runc start <container-id>
3. create.go
runC
execution starts from this function inside create.go
file :
create.go
|
|
One of the arguments passed to the create.go
is the file name to write the process id to.
The function revisePidFile
converts the pid filename to an absolute path and stashes it to the context
.
setupSpec
- reads the path to the bundle file , which contains the config.json
file. The function then loads the config.json
file
and decodes into the object *specs.Spec
.
The function also validates the Spec struct inside validateProcessSpec
.
The startContainer
is the main function and creates the container based on the specs.Spec
struct.
startContainer
utils_linux.go
|
|
This function starts by getting the container id from the context , then calls the createContainer
.
It then creates the runner
struct and runs it , passing in the spec.Process
.
createContainer
utils_linux.go
|
|
the function CreateLibcontainerConfig
creates a new libcontainer configuration (*configs.Config
) from a given specification and a cgroup name.
Then the function loadFactory
returns a linux based container factory based in the root directory for execing containers.
The LinuxFactory
has the following fields:
|
|
- The path to the executable that will be run in the init process
- The function to invoke when the init process starts.
factory.Create
takes the container id and configs and create LinuxContainer
in a stopped state.
The factory.Create
starts with some validations , then creates a directory with the correct permissions that will act as the root of the process to be executed.
|
|
- The container id
- the path to the root
- the config struct
- the path to the executable
/proc/self/exe
- the init argas to pass to the executable,
init
- the cgroup manager
run(specs.Process)
The run function executes the init
process inside the contained enviroment.
|
|
- creates the libcontainer Process to be executed with the arguments from the spec
- starts execution of the process
Here is the container.Start(process)
|
|
The createExecFifo (1)
creates a FIFO. runc create
creates the containers and init process,
preparing everything needed in order for the user’s init process to
start. Before it does the final execve
into the user’s code, it
blocks on the execFifo
(by attempting to write to it). This will
block until another process opens the FIFO for reading (which is what runc start
does). (https://groups.google.com/a/opencontainers.org/g/dev/c/ZKIFytzvilE)
The (2) is where the namespaces are setup and the child process is run inside the contained enviroment. This is done through a series of forks. We are going to go through this in details.
|
|
The function newParentProcess
creates the process that will eventually be the parent process.
The process is then started at (2)
, when this process returns, the child process has been setup.
The flow at a high level is as below:
- the parent process, created in
(1)
, starts the command which creates a new process executing the binary/proc/self/exe
and invoking theinit
function. - since
init.go
importsgithub.com/opencontainers/runc/libcontainer/nsenter
, the execution is transfered to theC
codensexec.c
insidevoid nsexec(void)
. - the main process then transfers the
bootstrap
config data to the init process - the init process reads the bootstrap configrations from the parent process
- init process updates the
oom_score_adj
(requires priveleged user) - if there is need to configure namespaces, the init process is made non-dumpable -
prctl(PR_SET_DUMPABLE, 0, 0, 0, 0)
- then the init process creates
sync_child_pipe
&sync_grandchild_pipe
for communicating with the child and grandchild respectively
-
the init process then calls
clone_parent()
, whichclones(2)
the parent and creates childclone(CLONE_PARENT | SIGCHLD)
. -
the child process:
-
checks from the bootstrap config data if there is need to join namespaces, then join all the provided namespaces (
config.namespaces
) -
next, unsharing namespaces, but this is not done in one go;
-
First, it starts with unsharing the usernamespaces; check if the config.cloneflags has the CLONE_NEWUSER flag set, if so, then its unshared:
1 2 3 4 5 6 7
if (config.cloneflags & CLONE_NEWUSER) { if (unshare(CLONE_NEWUSER) < 0) bail("failed to unshare user namespace"); config.cloneflags &= ~CLONE_NEWUSER; // --ellided }
-
since child does not have priveleges to do mappings, the child signals parent to do the mapping.
-
-
the parent process:
- the parent process updates the
/proc/%d/uid_map
and/proc/%d/gid_map
for child
- the parent process updates the
-
the child process:
-
makes the process non-dumpable
-
becomes root in the namespace:
setresuid(0, 0, 0)
since the rest ofunshare
will requre root priveleges in the child process. -
unshare
all of the rest of the namespaces, except the cgroup namespace:unshare(config.cloneflags & ~CLONE_NEWCGROUP)
-
the child process then creates the grand child. The reason for this fork again is so that we can enter the new PID namespace. calls to
unshare(2)
with theCLONE_NEWPID
flag cause children subsequently created by the caller to be placed in a different PID namespace from the caller. These calls do not, however, change the PID namespace of the calling process, because doing so would change the caller’s idea of its own PID (as reported by getpid()), which would break many applications and libraries. (https://man7.org/linux/man-pages/man7/pid_namespaces.7.html) -
send grand_child & child pid to the parent process.
-
the child process then exits.
-
-
the parent process:
- receive the grand_child & child pid from child
- send the pids to the
create.go
process
-
the grand_child process:
- setsid, setuid and setgid
-
the
create.go
process:- send clone cgroup signal to grand child to clone the cgroup.
-
grand child process:
unshare(CLONE_NEWCGROUP)
- returns control from
nsexec
and thego
runtime takes over. Control goes back toinit.go
.
init.go
At this point, we nsexec
has finished setting up required namespaces and unsharing as required.
From now onwards, the go
runtime will run all initialization that is necessary to set up a container.
During the initialization, the process sychronizes between the create.go
and the init.go
process.
|
|
-
newContainerInit
reads the config from thecreate.go
process through the pipemessageSockPair
. From the configs, it crates alinuxStandardInit
process.1 2 3 4 5 6 7 8
return &linuxStandardInit{ pipe: pipe, consoleSocket: consoleSocket, parentPid: unix.Getppid(), config: config, fifoFd: fifoFd, }, nil
-
The
Init()
function does the hard work of setting up the container:Inside the
init.go
process.- setup the network - defines configuration for a container’s networking stack
- setup the network route to create entries in the route table as the container is started
- prepareRootfs - sets up the devices, mount points, and filesystems for use inside a new mount namespace
- setup the console
- setup the hostname
- apply apparmor profile
- write sysctl
- configure readonly paths
- configure maskpaths
- set up nonewprivileges
- sync with parent.
Inside the
create.go
process- setupRlimit
- call prestart and CreateRuntime hooks
- then sync with the child
We continue in the
init.go
process:- Init Seccomp
- config the cababilities, apply bounds , setup user
- close the pipe to signal the
create.go
process that we are done. - then we attempt to write to the FIFO before execv into the users process.
init.go
blocks on theexecFifo
until another process opens the FIFO for reading (which is whatrunc start
does). - then set the seccomp profile as close to execve as possible, reducing the number of syscalls that need to be allowed/exempted.
- call all the
startContainer
hooks - finally,
unix.Exec
into the user process.
4. run.go
To be continued ..