§Dockerfiles are awesome
There is so much software out there packaged as Docker images. Operating systems, SQL and NoSQL databases, reverse proxies, compilers, everything. Safe to say, most of that software available as Docker containers is built from the common file format - the Dockerfile
. Dockerfiles
are awesome. They are recipes for getting a bit of software functional.
§how have I been building Firecracker VMMs so far
So far, all of my VMMs were built from Docker images using the following steps:
- pull / build a Docker image
- start a container with an additional volume where the host directory is a mounted
ext4
file system - copy the operating system directories I needed to this other volume
- stop the container
- use the resulting
ext4
file system as the root VMM volume
There isn’t much wrong with this process and it does work surprisingly well for the majority of the containers out there. This is also how Weaveworks Ignite works, this is what the official Firecracker documentation suggests and what many write ups on Firecracker describe.
§under the magnifier
The above approach gets us the first 98% of the work done. It’s okay. But, there are certain important details missing.
The most obvious one: after the conversion by copying, we lose the ENTRYPOINT
information. The Docker image provides us with two commands: ENTRYPOINT
and CMD
. Both instruct the container which program to run and what arguments to pass when the container starts. Without the ENTRYPOINT
and optionally the CMD
, the resulting VMM will start but it will not execute anything. We have to somehow modify the file system, post-copy, and add the command to start what we want to start.
In my previous write ups, I was adding a local service definition to the VMM during the copy stage. That is really cool but the problem with this approach is that virtually every image out there has its own dedicated configuration. Even if we could assume that 99% of all Docker images use the docker-entrypoint.sh
as a conventional ENTRYPOINT
, the CMD
is going to differ.
Then, there are additional parameters affecting the ENTRYPOINT
. There is the WORKDIR
and the USER
command, there is the SHELL
command, there are build arguments and environment variables.
By just copying the container file system, yes, we do get the final product. However, we are losing a lot of context and insight into what the result really is when it is to be started.
Finally, plenty of containers come with additional labels and exposed ports information. All that information is lost if we are not correlating the file system copy against the original Dockerfile
.
For sure, the manual build is doable but it won’t scale.
§anatomy of a Dockerfile
Back to a Dockerfile
. Let’s have a look at the official HashiCorp Consul Dockerfile as an example. Well, it ain’t 10 lines of code but it ain’t rocker science either. If we focus on the structure, it turns out to be fairly easy to understand:
- use base Alpine 3.12
- run some commands Linux commands
- … and, … that’s it, really, sprinkled with some environment variables and labels
The contract is: given a clean installation of the Alpine Linux 3.12, after executing the RUN
commands, one can execute the ENTRYPOINT
and have Consul up and running.
Not all Dockerfiles
are that easy. There is a lot of software out there built with multi-stage builds. To keep it simple and easy to track, let’s look at Kafka Proxy Dockerfile. Or Minio for that matter.
We can find two FROM
commands there. The first FROM
defines the named stage, people often call it builder
. Docker will basically build an image using every command until the next FROM
and save it on disk.
The next stage, the one without as ...
, let’s call it the main
stage, is created again from the base operating system. In case of kafka Proxy, it’s Alpine 3.12. In case of Minio, it is Red Hat UBI 8. The main
stage can copy the resources from the previous stages using the COPY --from=$stage-name
command. When such command is processed, Docker will reach into the first image it built and copy the selected resources into the main
stage image. Clever and very effective.
The builder
stage is essentially a cache. In both cases, it is a golang program that is compiled only once and the main
stage can be built quicker, assuming that the compiled output in the builder
stage hasn’t changed.
§we can build a VMM from a Dockerfile
It’s possible to take a Dockerfile
, parse it and apply all the relevant operations on a clean base operating system installation. The single stage build files are easy. Multi-stage builds can be a little more complex. Let’s consider what the process might look like.
There are two types of artifacts:
- named stages serve as resource cache
- the
rootfs
, the final build when all previous stages are built
There can be only one unnamed build stage in a Dockerfile
and it will always be built last.
Named stages:
- given a
Dockerfile
, parse it using the BuildKit dockerfile parser - find explicit stages delimited with the respective
FROM
commands - every build stage with
FROM ... as ...
can be built as a Docker image using the Moby client- for such build, remove the
as ...
part from theFROM
command and save using a random image name
- for such build, remove the
- build named stages as Docker images, no need to have a container
- for each stage
- export the image to a
tar
file - search the layers for all resources required by
COPY --from
commands; the layers are justtar
files embedded in the maintar
file
- export the image to a
- extract matched resources to a temporary build directory
- remove temporary image
- for each stage
Main build stage:
- requires an existing implementation of the underlying OS, think:
alpine:3.12
oralpine:3.13
- this is the only part which has to be built by hand
- execute relevant
RUN
commands in order, pay attention toARG
andENV
commands such that theRUN
commands are expanded correctly - execute
ADD
/COPY
commands, pay attention to the--from
flag - in both cases, keep track of
WORKDIR
andUSER
changes such that added / copied resources are placed under correct paths and commands are executed as correct users in correct locations
§why
By building the rootfs
in this way, it is possible to infer additional information which is otherwise lost when copying the file system from a running container. For example:
- correctly set up a local service to automatically start the application on VMM boot
- start the application user the
uid
/gid
defined in theDockerfile
- infer a shell from the
SHELL
command - extract otherwise missing metadata hiding in
LABEL
andEXPOSE
commands
Sounds doable. Does it make sense? Good question. Is the Docker image the right medium to source the Firecracker VMM root file system from? It gets us the first 98% of the work done but the devil is in details. Dockerfile
can get us all the way there.