The jailer

Making the Firecracker VMMs even more secure
thumbnail

A Firecracker release comes with two binaries - the firecracker and the jailer programs. The jailer brings even more isolation options to Firecracker by creating and securing a unique execution environment for each VMM.

§what can it do

  • check the uniqueness and validity of the VMM id, maximum length of 64 characters, alphanumeric only
  • assign NUMA node
  • check the existence of the exec_file
  • run the VMM as a specific user / group
  • assign cgroups
  • assign the VMM into a dedicated network namespace
  • a VMM can be damonized

§what does it do

This part comes from the jailer documentation[1]. When the jailer starts, it goes through the following process:

  • all paths and the VMM id will validated
  • all open file descriptors based on /proc/<jailer-pid>/fd except input, output and error will be closed
  • the <chroot_base>/<exec_file_name>/<id>/root directory will be created - this is the chroot_dir
    • exec_file_name is the last path component of exec_file (for example, that would be firecracker for /usr/bin/firecracker)
    • if the path already exists, the jailer will fail to start the VMM because the assumption is that the VMM IDs are unique
    • if exec_file is a link, jailer will readlink the value and use the name of the link source
  • the exec_file will copied to <chroot_base>/<exec_file_name>/<id>/root/<exec_file_name>
  • cgroups folder structure will be created; right now the jailer uses cgroup v1

On most systems, this is mounted by default in /sys/fs/cgroup (should be mounted by the user otherwise). The jailer will parse /proc/mounts to detect where each of the controllers required in --cgroup can be found (multiple controllers may share the same path). For each identified location (referred to as <cgroup_base>), the jailer creates the <cgroup_base>/<exec_file_name>/<id> subfolder, and writes the current pid to <cgroup_base>/<exec_file_name>/<id>/tasks. Also, the value passed for each <cgroup_file> is written to the file. If --node is used the corresponding values are written to the appropriate cpuset.mems and cpuset.cpus files.

  • unshare() into a new mount namespace will be called, use pivot_root() to switch the old system root mount point with a new one base in chroot_dir, switch the current working directory to the new root, unmount the old root mount point, and call chroot into the current directory
  • /dev/net/tun will be created inside of the jail using mknod
  • /dev/kvm will be created inside of the jail using mknod
  • the ownership of the chroot_dir, /dev/net/tun and /dev/kvm will be changed using chown based on the provided uid:gid
  • if --netns <netns> is present, attempt to join the specified network namespace
  • if --daemonize is specified, call setsid() and redirect STDIN, STDOUT, and STDERR to /dev/null.
  • privileges will be dropped by setting the provided uid:gid
  • exec into <exec_file_name> --id=<id> --start-time-us=<opaque> --start-time-cpu-us=<opaque> and forward any extra arguments provided to the jailer after --, where:
    • id: (string) - the id argument provided to jailer
    • opaque: (number) time calculated by the jailer that it spent doing its work

The jailer seems to be the proper way of running Firecracker VMMs. firectl, which I have discussed previously, has the jailer support. It was pretty easy to convert existing VMMs. There’s a couple of quirks to the firectl configuration, mostly - arguments must be explicitly assigned. The Golang SDK supports the defaults, like /srv/jailer for the chroot_base but firectl does not properly use them internally so just make sure you always pass them.

§how to do it

Here’s how I run my VMM via the jailer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
sudo $GOPATH/bin/firectl \
    --jailer=/usr/bin/jailer \
    --exec-file=$(readlink /usr/bin/firecracker) \
    --id=alpine1 \
    --chroot-base-dir=/srv/jailer \
    --kernel=/firecracker/kernels/vmlinux-v5.8 \
    --root-drive=/firecracker/filesystems/alpine-base-root.ext4 \
    --cni-network=alpine \
    --veth-iface-name=alpine1 \
    --ncpus=1 \
    --memory=128

The above will start the Firecracker VMM via the /usr/bin/jailer binary.

I use readlink because my /usr/bin/firecracker is a link to /usr/bin/firecracker-v0.22.4-x86_64. If I don’t use readlink, the jailer for whatever reason creates <chroot_dir>/firecracker but attempts to launch the VMM from <chroot_dir>/firecracker-v0.22.4-x86_64 directory. readlink avoids that problem in my setup.

I have assigned a unique id to my VMM and explicitly passed the --chroot-base-dir. If I would not, this would have happened. The rest is the standard Firecracker firectl stuff discussed in the previous write ups.

All omitted arguments are set to their defaults so things like uid:gid and NUMA node will be all 0. Good for now.

Here’s what the chroot_dir structure looks like for a VMM with only a root file system:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
$ sudo tree /srv/jailer/
/srv/jailer/
└── firecracker-v0.22.4-x86_64
    └── alpine1
        └── root
            ├── alpine-base-root.ext4
            ├── dev
            │   ├── kvm
            │   └── net
            │       └── tun
            ├── firecracker-v0.22.4-x86_64
            ├── run
            │   └── firecracker.socket
            └── vmlinux-v5.8
  • the root/alpine-base-root.ext4 is a link to the actual file system
  • the root/vmlinux-v5.8 is the a link to the actual kernel

§chroot strategy

The file system and the kernel linking is not done by the jailer. It’s the firectl doing it via the chroot strategy mechanism. The Golang SDK provides a default naive strategy,. It’s actually called like that, I’m not being cocky. The default strategy can be replace with a custom logic implementing the firecracker.HandlerAdapter interface.

So in AWS, one selects a base AMI and launches a VM from it. That creates a volume and subsequent VM starts use that volume. This could be a way forward to build something similar for Firecracker.

§closing words

I have subconsciously avoided touching the jailer before as I have seen it as a pretty complex feature. Considering what it gives, I must admit, it was very easy to get it in. I haven’t yet tried launching anything under a specific uid:gid but I do not expect any issues there.