Runc

March 12, 2021   

runC code review

The Open Container Initiative https://opencontainers.org/about/overview/[OCI] maintains specifications for standards on Operating System process and application containers. Currently, OCI has two sepcifications:

  1. Runtime Specification (https://github.com/opencontainers/runtime-spec) and
  2. Image Specification (https://github.com/opencontainers/image-spec).

The runtime-spec is concerned with the runtime that can be used to run an OCI compliant image-spec on a given OS platform. You can find OCI compliant run-times here https://github.com/opencontainers/runtime-spec/blob/master/implementations.md.

runC https://github.com/opencontainers/runc is a CLI tool for spawning and running containers according to the OCI specification.

1. Introduction

Am not going to through all the code,but am going to focus on the main aspects esepecially on Linux. Also some of the basics will be taken for greanted, so if you need a refresher follow the links above.

2. Big picture

As per the run-time spec, runC must support the following operations:

  1. state This operations checks (queries) the state of an existing container.

     state <container-id>
    
  2. create - This operation creates a new container based on the config.json input file.

     create <container-id> <path-bundle>
    
  3. start - This operations starts the process specified in the container created in step 2 above. The container must be existing.

      start <container-id>
    
  4. kill - sends the signal specified to the running process in the conatiner.

       kill <container-id> <signal>
    
  5. delete - deletes the container, which must be in a running state.

    delete <container-id> 

The runC codebase is go and a tiny amount of c.

Code base layout

Creating a container

When runC is invoked

    runc create <container-id>
  1. Every time runc is called, the code inside github.com/opencontainers/runc/libcontainer/nsenter will execute (before the Go runtime boots). However, because the "_LIBCONTAINER_INITPIPE" environment variable is not set[1] then that code just exits and the Go runtime takes over.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
static int initpipe(void)
{
	int pipenum;
	char *initpipe, *endptr;

	initpipe = getenv("_LIBCONTAINER_INITPIPE");   //1
	if (initpipe == NULL || *initpipe == '\0')
		return -1;

	pipenum = strtol(initpipe, &endptr, 10);
	if (*endptr != '\0')
		bail("unable to parse _LIBCONTAINER_INITPIPE");

	return pipenum;
}
  1. runc create – checks the bundle and figures out what needs to be done, as we will see later in details. Then it creates a runc init subprocess (which will eventually become the PID 1 process inside the container) – with "_LIBCONTAINER_INITPIPE" set to an fd used for communication between the two processes.

  2. runc init – nsenter runs and because "_LIBCONTAINER_INITPIPE" is set to a valid fd nsexec() will run and set up all of the namespaces that were set in config.json. This involves creating a bunch of processes, and at the end it will return and the Go runtime will boot up.

  3. runc init – After nsexec() returns, then factory.StartInitialization() inside init.go runs. It will run all of the Go initialization that is necessary to set up a container.

  4. runc init – at the very end of all of that, runc init stops executing and waits to be told to execve(2) the container process (as set in config.json).

  5. runc create – exits because it has nothing else to do.

Running a container

  1. runc start – signals the runc init process to start running the user’s process inside the container by reading from the FIFO to which the create process writes to.
    runc start <container-id>

3. create.go

runC execution starts from this function inside create.go file :

create.go

354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
Action: func(context *cli.Context) error {
		if err := checkArgs(context, 1, exactArgs); err != nil {
			return err
		}
		if err := revisePidFile(context); err != nil {
			return err
		}
		spec, err := setupSpec(context)
		if err != nil {
			return err
		}
		status, err := startContainer(context, spec, CT_ACT_CREATE, nil)
		if err != nil {
			return err
		}
		// exit with the container's exit status so any external supervisor is
		// notified of the exit with the correct exit status.
		os.Exit(status)
		return nil
	}

One of the arguments passed to the create.go is the file name to write the process id to.

The function revisePidFile converts the pid filename to an absolute path and stashes it to the context.

setupSpec - reads the path to the bundle file , which contains the config.json file. The function then loads the config.json file and decodes into the object *specs.Spec. The function also validates the Spec struct inside validateProcessSpec.

The startContainer is the main function and creates the container based on the specs.Spec struct.

startContainer

utils_linux.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
func startContainer(context *cli.Context, spec *specs.Spec, action CtAct, criuOpts *libcontainer.CriuOpts) (int, error) {
	id := context.Args().First()
	if id == "" {
		return -1, errEmptyID
	}

	//---elided

	container, err := createContainer(context, id, spec)
	if err != nil {
		return -1, err
	}


	//--more elided

	r := &runner{
		enableSubreaper: !context.Bool("no-subreaper"),
		shouldDestroy:   true,
		container:       container,
		listenFDs:       listenFDs,
		notifySocket:    notifySocket,
		consoleSocket:   context.String("console-socket"),
		detach:          context.Bool("detach"),
		pidFile:         context.String("pid-file"),
		preserveFDs:     context.Int("preserve-fds"),
		action:          action,
		criuOpts:        criuOpts,
		init:            true,
		logLevel:        logLevel,
	}
	return r.run(spec.Process)
}

This function starts by getting the container id from the context , then calls the createContainer. It then creates the runner struct and runs it , passing in the spec.Process.

createContainer

utils_linux.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
func createContainer(context *cli.Context, id string, spec *specs.Spec) (libcontainer.Container, error) {
	rootlessCg, err := shouldUseRootlessCgroupManager(context)
	if err != nil {
		return nil, err
	}
	config, err := specconv.CreateLibcontainerConfig(&specconv.CreateOpts{
		CgroupName:       id,
		UseSystemdCgroup: context.GlobalBool("systemd-cgroup"),
		NoPivotRoot:      context.Bool("no-pivot"),
		NoNewKeyring:     context.Bool("no-new-keyring"),
		Spec:             spec,
		RootlessEUID:     os.Geteuid() != 0,
		RootlessCgroups:  rootlessCg,
	})
	if err != nil {
		return nil, err
	}

	factory, err := loadFactory(context)
	if err != nil {
		return nil, err
	}
	return factory.Create(id, config)
}

the function CreateLibcontainerConfig creates a new libcontainer configuration (*configs.Config) from a given specification and a cgroup name.

Then the function loadFactory returns a linux based container factory based in the root directory for execing containers.

The LinuxFactory has the following fields:

354
355
356
357
358
359
360
l := &LinuxFactory{
		Root:      root,  
		InitPath:  "/proc/self/exe",                          //1
		InitArgs:  []string{os.Args[0], "init"},              //2
		Validator: validate.New(),
		CriuPath:  "criu",
	}
  1. The path to the executable that will be run in the init process
  2. The function to invoke when the init process starts.

factory.Create takes the container id and configs and create LinuxContainer in a stopped state. The factory.Create starts with some validations , then creates a directory with the correct permissions that will act as the root of the process to be executed.

354
355
356
357
358
359
360
361
362
363
364
c := &linuxContainer{
		id:            id,                              //1
		root:          containerRoot,                   //2
		config:        config,                          //3
		initPath:      l.InitPath,                      //4
		initArgs:      l.InitArgs,                      //5
		criuPath:      l.CriuPath,     
		newuidmapPath: l.NewuidmapPath,
		newgidmapPath: l.NewgidmapPath,
		cgroupManager: l.NewCgroupsManager(config.Cgroups, nil), //6
	}
  1. The container id
  2. the path to the root
  3. the config struct
  4. the path to the executable /proc/self/exe
  5. the init argas to pass to the executable, init
  6. the cgroup manager

run(specs.Process)

The run function executes the init process inside the contained enviroment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
func (r *runner) run(config *specs.Process) (int, error) {
	//-- ellided
	process, err := newProcess(*config, r.init, r.logLevel)  //1
	if err != nil {
		return -1, err
	}
	//--ellided
	var (
		detach = r.detach || (r.action == CT_ACT_CREATE)
	)
	//--ellided
	switch r.action {
	case CT_ACT_CREATE:
		err = r.container.Start(process)                      //2
	case CT_ACT_RESTORE:
		err = r.container.Restore(process, r.criuOpts)
	case CT_ACT_RUN:
		err = r.container.Run(process)
	default:
		panic("Unknown action")
	}
	//--ellided
	return status, err
}
  1. creates the libcontainer Process to be executed with the arguments from the spec
  2. starts execution of the process

Here is the container.Start(process)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
func (c *linuxContainer) Start(process *Process) error {
	//--ellided
	if process.Init {
		if err := c.createExecFifo(); err != nil {     //1
			return err
		}
	}
	if err := c.start(process); err != nil {          //2
		if process.Init {
			c.deleteExecFifo()
		}
		return err
	}
	return nil
}

The createExecFifo (1) creates a FIFO. runc create creates the containers and init process, preparing everything needed in order for the user’s init process to start. Before it does the final execve into the user’s code, it blocks on the execFifo (by attempting to write to it). This will block until another process opens the FIFO for reading (which is what runc start does). (https://groups.google.com/a/opencontainers.org/g/dev/c/ZKIFytzvilE)

The (2) is where the namespaces are setup and the child process is run inside the contained enviroment. This is done through a series of forks. We are going to go through this in details.

354
355
356
357
358
359
360
361
362
363
364
365
func (c *linuxContainer) start(process *Process) error {
	parent, err := c.newParentProcess(process)         //1
	if err != nil {
		return newSystemErrorWithCause(err, "creating new parent process")
	}
	parent.forwardChildLogs()
	if err := parent.start(); err != nil {             //2
		return newSystemErrorWithCause(err, "starting container process")
	}
	//-- ellided
	return nil
}

The function newParentProcess creates the process that will eventually be the parent process. The process is then started at (2) , when this process returns, the child process has been setup.

The flow at a high level is as below:

  • the parent process, created in (1), starts the command which creates a new process executing the binary /proc/self/exe and invoking the init function.
  • since init.go imports github.com/opencontainers/runc/libcontainer/nsenter, the execution is transfered to the C code nsexec.c inside void nsexec(void).
  • the main process then transfers the bootstrap config data to the init process
  • the init process reads the bootstrap configrations from the parent process
  • init process updates the oom_score_adj (requires priveleged user)
  • if there is need to configure namespaces, the init process is made non-dumpable - prctl(PR_SET_DUMPABLE, 0, 0, 0, 0)
  • then the init process creates sync_child_pipe & sync_grandchild_pipe for communicating with the child and grandchild respectively

alt text

  • the init process then calls clone_parent(), which clones(2) the parent and creates child clone(CLONE_PARENT | SIGCHLD).

  • the child process:

    • checks from the bootstrap config data if there is need to join namespaces, then join all the provided namespaces (config.namespaces)

    • next, unsharing namespaces, but this is not done in one go;

    • First, it starts with unsharing the usernamespaces; check if the config.cloneflags has the CLONE_NEWUSER flag set, if so, then its unshared:

      1
      2
      3
      4
      5
      6
      7
      
      		if (config.cloneflags & CLONE_NEWUSER) {
      			if (unshare(CLONE_NEWUSER) < 0)
      				bail("failed to unshare user namespace");
      				config.cloneflags &= ~CLONE_NEWUSER;
      				// --ellided
      		}
      		

    • since child does not have priveleges to do mappings, the child signals parent to do the mapping.

  • the parent process:

    • the parent process updates the /proc/%d/uid_map and /proc/%d/gid_map for child

    alt text

  • the child process:

    • makes the process non-dumpable

    • becomes root in the namespace: setresuid(0, 0, 0) since the rest of unshare will requre root priveleges in the child process.

    • unshare all of the rest of the namespaces, except the cgroup namespace: unshare(config.cloneflags & ~CLONE_NEWCGROUP)

    • the child process then creates the grand child. The reason for this fork again is so that we can enter the new PID namespace. calls to unshare(2) with the CLONE_NEWPID flag cause children subsequently created by the caller to be placed in a different PID namespace from the caller. These calls do not, however, change the PID namespace of the calling process, because doing so would change the caller’s idea of its own PID (as reported by getpid()), which would break many applications and libraries. (https://man7.org/linux/man-pages/man7/pid_namespaces.7.html)

    • send grand_child & child pid to the parent process.

    • the child process then exits.

    alt text

  • the parent process:

    • receive the grand_child & child pid from child
    • send the pids to the create.go process
  • the grand_child process:

    • setsid, setuid and setgid
  • the create.go process:

    • send clone cgroup signal to grand child to clone the cgroup.
  • grand child process:

    • unshare(CLONE_NEWCGROUP)
    • returns control from nsexec and the go runtime takes over. Control goes back to init.go.

init.go

At this point, we nsexec has finished setting up required namespaces and unsharing as required. From now onwards, the go runtime will run all initialization that is necessary to set up a container. During the initialization, the process sychronizes between the create.go and the init.go process.

alt text

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
func (l *LinuxFactory) StartInitialization() (err error) {
	// -- ellidded
	
	i, err := newContainerInit(it, pipe, consoleSocket, fifofd)  //1
	if err != nil {
		return err
	}

	return i.Init()                                              //2
}
  1. newContainerInit reads the config from the create.go process through the pipe messageSockPair. From the configs, it crates a linuxStandardInit process.

    1
    2
    3
    4
    5
    6
    7
    8
    
       return &linuxStandardInit{
    			pipe:          pipe,
    			consoleSocket: consoleSocket,
    			parentPid:     unix.Getppid(),
    			config:        config,
    			fifoFd:        fifoFd,
    		}, nil
    	
    
  2. The Init() function does the hard work of setting up the container:

    Inside the init.go process.

    • setup the network - defines configuration for a container’s networking stack
    • setup the network route to create entries in the route table as the container is started
    • prepareRootfs - sets up the devices, mount points, and filesystems for use inside a new mount namespace
    • setup the console
    • setup the hostname
    • apply apparmor profile
    • write sysctl
    • configure readonly paths
    • configure maskpaths
    • set up nonewprivileges
    • sync with parent.

    Inside the create.go process

    • setupRlimit
    • call prestart and CreateRuntime hooks
    • then sync with the child

    We continue in the init.go process:

    • Init Seccomp
    • config the cababilities, apply bounds , setup user
    • close the pipe to signal the create.go process that we are done.
    • then we attempt to write to the FIFO before execv into the users process. init.go blocks on the execFifo until another process opens the FIFO for reading (which is what runc start does).
    • then set the seccomp profile as close to execve as possible, reducing the number of syscalls that need to be allowed/exempted.
    • call all the startContainer hooks
    • finally, unix.Exec into the user process.

4. run.go

To be continued ..