A Firecracker release comes with two binaries - the firecracker and the jailer programs. The jailer brings even more isolation options to Firecracker by creating and securing a unique execution environment for each VMM.
§what can it do
- check the uniqueness and validity of the VMM
id, maximum length of64characters, alphanumeric only - assign NUMA node
- check the existence of the
exec_file - run the VMM as a specific user / group
- assign
cgroups - assign the VMM into a dedicated network namespace
- a VMM can be damonized
§what does it do
This part comes from the jailer documentation[1]. When the jailer starts, it goes through the following process:
- all paths and the VMM id will validated
- all open file descriptors based on
/proc/<jailer-pid>/fdexceptinput,outputanderrorwill be closed - the
<chroot_base>/<exec_file_name>/<id>/rootdirectory will be created - this is thechroot_direxec_file_nameis the last path component ofexec_file(for example, that would be firecracker for/usr/bin/firecracker)- if the path already exists, the jailer will fail to start the VMM because the assumption is that the VMM IDs are unique
- if
exec_fileis a link, jailer willreadlinkthe value and use the name of the link source
- the
exec_filewill copied to<chroot_base>/<exec_file_name>/<id>/root/<exec_file_name> cgroupsfolder structure will be created; right now the jailer usescgroup v1
On most systems, this is mounted by default in /sys/fs/cgroup (should be mounted by the user otherwise). The jailer will parse /proc/mounts to detect where each of the controllers required in --cgroup can be found (multiple controllers may share the same path). For each identified location (referred to as <cgroup_base>), the jailer creates the <cgroup_base>/<exec_file_name>/<id> subfolder, and writes the current pid to <cgroup_base>/<exec_file_name>/<id>/tasks. Also, the value passed for each <cgroup_file> is written to the file. If --node is used the corresponding values are written to the appropriate cpuset.mems and cpuset.cpus files.
unshare()into a new mount namespace will be called, usepivot_root()to switch the old system root mount point with a new one base inchroot_dir, switch the current working directory to the new root, unmount the old root mount point, and callchrootinto the current directory/dev/net/tunwill be created inside of the jail usingmknod/dev/kvmwill be created inside of the jail usingmknod- the ownership of the
chroot_dir,/dev/net/tunand/dev/kvmwill be changed usingchownbased on the provideduid:gid - if
--netns <netns>is present, attempt to join the specified network namespace - if
--daemonizeis specified, callsetsid()and redirectSTDIN,STDOUT, andSTDERRto/dev/null. - privileges will be dropped by setting the provided
uid:gid - exec into
<exec_file_name> --id=<id> --start-time-us=<opaque> --start-time-cpu-us=<opaque>and forward any extra arguments provided to the jailer after--, where:id: (string) - the id argument provided to jaileropaque: (number) time calculated by the jailer that it spent doing its work
The jailer seems to be the proper way of running Firecracker VMMs. firectl, which I have discussed previously, has the jailer support. It was pretty easy to convert existing VMMs. There’s a couple of quirks to the firectl configuration, mostly - arguments must be explicitly assigned. The Golang SDK supports the defaults, like /srv/jailer for the chroot_base but firectl does not properly use them internally so just make sure you always pass them.
§how to do it
Here’s how I run my VMM via the jailer:
|
|
The above will start the Firecracker VMM via the /usr/bin/jailer binary.
I use readlink because my /usr/bin/firecracker is a link to /usr/bin/firecracker-v0.22.4-x86_64. If I don’t use readlink, the jailer for whatever reason creates <chroot_dir>/firecracker but attempts to launch the VMM from <chroot_dir>/firecracker-v0.22.4-x86_64 directory. readlink avoids that problem in my setup.
I have assigned a unique id to my VMM and explicitly passed the --chroot-base-dir. If I would not, this would have happened. The rest is the standard Firecracker firectl stuff discussed in the previous write ups.
All omitted arguments are set to their defaults so things like uid:gid and NUMA node will be all 0. Good for now.
Here’s what the chroot_dir structure looks like for a VMM with only a root file system:
|
|
- the
root/alpine-base-root.ext4is a link to the actual file system - the
root/vmlinux-v5.8is the a link to the actual kernel
§chroot strategy
The file system and the kernel linking is not done by the jailer. It’s the firectl doing it via the chroot strategy mechanism. The Golang SDK provides a default naive strategy,. It’s actually called like that, I’m not being cocky. The default strategy can be replace with a custom logic implementing the firecracker.HandlerAdapter interface.
So in AWS, one selects a base AMI and launches a VM from it. That creates a volume and subsequent VM starts use that volume. This could be a way forward to build something similar for Firecracker.
§closing words
I have subconsciously avoided touching the jailer before as I have seen it as a pretty complex feature. Considering what it gives, I must admit, it was very easy to get it in. I haven’t yet tried launching anything under a specific uid:gid but I do not expect any issues there.