Thoughts on creating VMMs from Docker images

Posted on
docker firecracker microvm
thumbnail

Dockerfiles are awesome

There is so much software out there packaged as Docker images. Operating systems, SQL and NoSQL databases, reverse proxies, compilers, everything. Safe to say, most of that software available as Docker containers is built from the common file format - the Dockerfile. Dockerfiles are awesome. They are recipes for getting a bit of software functional.

how have I been building Firecracker VMMs so far

So far, all of my VMMs were built from Docker images using the following steps:

  • pull / build a Docker image
  • start a container with an additional volume where the host directory is a mounted ext4 file system
  • copy the operating system directories I needed to this other volume
  • stop the container
  • use the resulting ext4 file system as the root VMM volume

Here’s an example.

There isn’t much wrong with this process and it does work surprisingly well for the majority of the containers out there. This is also how Weaveworks Ignite works, this is what the official Firecracker documentation suggests and what many write ups on Firecracker describe.

under the magnifier

The above approach gets us the first 98% of the work done. It’s okay. But, there are certain important details missing.

The most obvious one: after the conversion by copying, we lose the ENTRYPOINT information. The Docker image provides us with two commands: ENTRYPOINT and CMD. Both instruct the container which program to run and what arguments to pass when the container starts. Without the ENTRYPOINT and optionally the CMD, the resulting VMM will start but it will not execute anything. We have to somehow modify the file system, post-copy, and add the command to start what we want to start.

In my previous write ups, I was adding a local service definition to the VMM during the copy stage. That is really cool but the problem with this approach is that virtually every image out there has its own dedicated configuration. Even if we could assume that 99% of all Docker images use the docker-entrypoint.sh as a conventional ENTRYPOINT, the CMD is going to differ.

Then, there are additional parameters affecting the ENTRYPOINT. There is the WORKDIR and the USER command, there is the SHELL command, there are build arguments and environment variables.

By just copying the container file system, yes, we do get the final product. However, we are losing a lot of context and insight into what the result really is when it is to be started.

Finally, plenty of containers come with additional labels and exposed ports information. All that information is lost if we are not correlating the file system copy against the original Dockerfile.

For sure, the manual build is doable but it won’t scale.

anatomy of a Dockerfile

Back to a Dockerfile. Let’s have a look at the official HashiCorp Consul Dockerfile as an example. Well, it ain’t 10 lines of code but it ain’t rocker science either. If we focus on the structure, it turns out to be fairly easy to understand:

  • use base Alpine 3.12
  • run some commands Linux commands
  • … and, … that’s it, really, sprinkled with some environment variables and labels

The contract is: given a clean installation of the Alpine Linux 3.12, after executing the RUN commands, one can execute the ENTRYPOINT and have Consul up and running.

Not all Dockerfiles are that easy. There is a lot of software out there built with multi-stage builds. To keep it simple and easy to track, let’s look at Kafka Proxy Dockerfile. Or Minio for that matter.

We can find two FROM commands there. The first FROM defines the named stage, people often call it builder. Docker will basically build an image using every command until the next FROM and save it on disk.

The next stage, the one without as ..., let’s call it the main stage, is created again from the base operating system. In case of kafka Proxy, it’s Alpine 3.12. In case of Minio, it is Red Hat UBI 8. The main stage can copy the resources from the previous stages using the COPY --from=$stage-name command. When such command is processed, Docker will reach into the first image it built and copy the selected resources into the main stage image. Clever and very effective.

The builder stage is essentially a cache. In both cases, it is a golang program that is compiled only once and the main stage can be built quicker, assuming that the compiled output in the builder stage hasn’t changed.

we can build a VMM from a Dockerfile

It’s possible to take a Dockerfile, parse it and apply all the relevant operations on a clean base operating system installation. The single stage build files are easy. Multi-stage builds can be a little more complex. Let’s consider what the process might look like.

There are two types of artifacts:

  • named stages serve as resource cache
  • the rootfs, the final build when all previous stages are built

There can be only one unnamed build stage in a Dockerfile and it will always be built last.

Named stages:

  • given a Dockerfile, parse it using the BuildKit dockerfile parser
  • find explicit stages delimited with the respective FROM commands
  • every build stage with FROM ... as ... can be built as a Docker image using the Moby client
    • for such build, remove the as ... part from the FROM command and save using a random image name
  • build named stages as Docker images, no need to have a container
    • for each stage
      • export the image to a tar file
      • search the layers for all resources required by COPY --from commands; the layers are just tar files embedded in the main tar file
    • extract matched resources to a temporary build directory
    • remove temporary image

Main build stage:

  • requires an existing implementation of the underlying OS, think: alpine:3.12 or alpine:3.13
    • this is the only part which has to be built by hand
  • execute relevant RUN commands in order, pay attention to ARG and ENV commands such that the RUN commands are expanded correctly
  • execute ADD / COPY commands, pay attention to the --from flag
  • in both cases, keep track of WORKDIR and USER changes such that added / copied resources are placed under correct paths and commands are executed as correct users in correct locations

why

By building the rootfs in this way, it is possible to infer additional information which is otherwise lost when copying the file system from a running container. For example:

  • correctly set up a local service to automatically start the application on VMM boot
  • start the application user the uid / gid defined in the Dockerfile
  • infer a shell from the SHELL command
  • extract otherwise missing metadata hiding in LABEL and EXPOSE commands

Sounds doable. Does it make sense? Good question. Is the Docker image the right medium to source the Firecracker VMM root file system from? It gets us the first 98% of the work done but the devil is in details. Dockerfile can get us all the way there.