A Firecracker release comes with two binaries - the firecracker
and the jailer
programs. The jailer brings even more isolation options to Firecracker by creating and securing a unique execution environment for each VMM.
§what can it do
- check the uniqueness and validity of the VMM
id
, maximum length of64
characters, alphanumeric only - assign NUMA node
- check the existence of the
exec_file
- run the VMM as a specific user / group
- assign
cgroups
- assign the VMM into a dedicated network namespace
- a VMM can be damonized
§what does it do
This part comes from the jailer documentation[1]. When the jailer starts, it goes through the following process:
- all paths and the VMM id will validated
- all open file descriptors based on
/proc/<jailer-pid>/fd
exceptinput
,output
anderror
will be closed - the
<chroot_base>/<exec_file_name>/<id>/root
directory will be created - this is thechroot_dir
exec_file_name
is the last path component ofexec_file
(for example, that would be firecracker for/usr/bin/firecracker
)- if the path already exists, the jailer will fail to start the VMM because the assumption is that the VMM IDs are unique
- if
exec_file
is a link, jailer willreadlink
the value and use the name of the link source
- the
exec_file
will copied to<chroot_base>/<exec_file_name>/<id>/root/<exec_file_name>
cgroups
folder structure will be created; right now the jailer usescgroup v1
On most systems, this is mounted by default in /sys/fs/cgroup
(should be mounted by the user otherwise). The jailer will parse /proc/mounts
to detect where each of the controllers required in --cgroup
can be found (multiple controllers may share the same path). For each identified location (referred to as <cgroup_base>
), the jailer creates the <cgroup_base>/<exec_file_name>/<id>
subfolder, and writes the current pid to <cgroup_base>/<exec_file_name>/<id>/tasks
. Also, the value passed for each <cgroup_file>
is written to the file. If --node
is used the corresponding values are written to the appropriate cpuset.mems
and cpuset.cpus
files.
unshare()
into a new mount namespace will be called, usepivot_root()
to switch the old system root mount point with a new one base inchroot_dir
, switch the current working directory to the new root, unmount the old root mount point, and callchroot
into the current directory/dev/net/tun
will be created inside of the jail usingmknod
/dev/kvm
will be created inside of the jail usingmknod
- the ownership of the
chroot_dir
,/dev/net/tun
and/dev/kvm
will be changed usingchown
based on the provideduid:gid
- if
--netns <netns>
is present, attempt to join the specified network namespace - if
--daemonize
is specified, callsetsid()
and redirectSTDIN
,STDOUT
, andSTDERR
to/dev/null
. - privileges will be dropped by setting the provided
uid:gid
- exec into
<exec_file_name> --id=<id> --start-time-us=<opaque> --start-time-cpu-us=<opaque>
and forward any extra arguments provided to the jailer after--
, where:id
: (string
) - the id argument provided to jaileropaque
: (number
) time calculated by the jailer that it spent doing its work
The jailer seems to be the proper way of running Firecracker VMMs. firectl
, which I have discussed previously, has the jailer
support. It was pretty easy to convert existing VMMs. There’s a couple of quirks to the firectl
configuration, mostly - arguments must be explicitly assigned. The Golang SDK supports the defaults, like /srv/jailer
for the chroot_base
but firectl
does not properly use them internally so just make sure you always pass them.
§how to do it
Here’s how I run my VMM via the jailer:
|
|
The above will start the Firecracker VMM via the /usr/bin/jailer
binary.
I use readlink
because my /usr/bin/firecracker
is a link to /usr/bin/firecracker-v0.22.4-x86_64
. If I don’t use readlink
, the jailer for whatever reason creates <chroot_dir>/firecracker
but attempts to launch the VMM from <chroot_dir>/firecracker-v0.22.4-x86_64
directory. readlink
avoids that problem in my setup.
I have assigned a unique id
to my VMM and explicitly passed the --chroot-base-dir
. If I would not, this would have happened. The rest is the standard Firecracker firectl
stuff discussed in the previous write ups.
All omitted arguments are set to their defaults so things like uid:gid
and NUMA node will be all 0
. Good for now.
Here’s what the chroot_dir
structure looks like for a VMM with only a root file system:
|
|
- the
root/alpine-base-root.ext4
is a link to the actual file system - the
root/vmlinux-v5.8
is the a link to the actual kernel
§chroot strategy
The file system and the kernel linking is not done by the jailer. It’s the firectl
doing it via the chroot strategy
mechanism. The Golang SDK provides a default naive strategy,. It’s actually called like that, I’m not being cocky. The default strategy can be replace with a custom logic implementing the firecracker.HandlerAdapter
interface.
So in AWS, one selects a base AMI and launches a VM from it. That creates a volume and subsequent VM starts use that volume. This could be a way forward to build something similar for Firecracker.
§closing words
I have subconsciously avoided touching the jailer before as I have seen it as a pretty complex feature. Considering what it gives, I must admit, it was very easy to get it in. I haven’t yet tried launching anything under a specific uid:gid
but I do not expect any issues there.