Bringing NetBSD to Zig's Continuous Integration

Hello, if this is your first time reading this, I'm luna, one half of LavaTech and we provide continuous integration (CI) services for the Zig programming language.

We started doing this because Zig could not use their pre-existing CI service for FreeBSD builds, sr.ht, because the compiler became very RAM-intensive and that hits the limits on their CI. We stepped in and we now provide a selfhosted Sourcehut instance where FreeBSD builds can still happen while the self-hosted compiler is not finished yet (Genuinely, I have no idea what will happen once the memory usage is brought down, but for now, I'm happy to provide the service, and wish to continue).

On a sunny day with nothing much to do, I decided to bring better NetBSD support for Zig, and one of the areas I can learn on is the one where I'm operating: CI.

This blogpost outlines the things I had to do to bring NetBSD to Zig CI.

Some background, or, How does CI work

Sourcehut is a “software forge”, which is a collection of tools to manage codebases, either publicly or privately. sr.ht is the main instance of Sourcehut, operated by its original creator, Drew Devault.

builds.sr.ht, the CI service of Sourcehut, works by providing QEMU/KVM virtual machines that run specifically-crafted operating system images, and from them, your project can be built and the resulting artifacts can be tested inside the virtual machine, all via executing SSH into the VM. After all the commands are run without errors, the build has “passed”. An example of builds.sr.ht on the main instance can be seen here

The outline of the available VM images can be seen in the compatibility page of builds.sr.ht, and from there you can see Alpine Linux, Arch Linux, FreeBSD, among others (hell, even 9front!).

NetBSD on sr.ht

Back when I was developing support for it, the compatibility page did not mention NetBSD, and thanks to Michael Forney, it now is!

The history for it is as follows: – The initial versions of the NetBSD image were created by Drew Devault. – However, they weren't really finished. – I began working on finishing it. – Michael Forney also began working on it, in parallel. Their patch has been successfully upstreamed! – Even though we worked in parallel, our changes are mostly equivalent and so, there isn't a need to upstream my changes: – Replacing anita (The automated NetBSD installer) for directly downloading the binary sets and partitioning the virtual disk. This is done to remove the QEMU requirement on the build script, because I am building the image in a NetBSD VM itself, and I was not sure on how to bring nested virtualization to it. – Replacing pkgsrc building source packages with downloading binary packages via pkgin.

Here is a successful test build with the image!

Preparing the CI

Zig CI does not depend on the upstream platforms' LLVM builds. Instead, a self-built LLVM is created via the zig-bootstrap project. With it, you can go from a C compiler to a fully functioning Zig compiler for any architecture. It achieves this via four large compile steps (with the relevant names I will use for them throughout this blogpost): – Compiling LLVM for the host system (“llvm-host”). – Compiling Zig for the host system (“zig-host”). – Compiling LLVM for the target system (“llvm-crosscompiled”). – Compiling Zig for the target system (“zig-crosscompiled”).

With, for example, the zig-bootstrap result for aarch64-linux-musl, you can start an ARM64 Linux CI, where zig-crosscompiled compiles the new commits of Zig. This is possible because Zig is a C compiler as well. Here's the script for Zig Linux CI on Azure of such artifacts working in a CI script ($CACHE_BASENAME).

For NetBSD, the process shouldn't be different (FreeBSD follows it as well), so what we need to do is get zig-bootstrap running on it. This process took a couple of days, with help of andrewrk and washbear, and long arduous moments of the having to cyclically “Start the build, find something to do, mentally switch back to fixing the build”. It all paid off, and I generated fantastic quickfixes in the end! Here they all are!.

The submitted PR is more of a discussion place about what can be done regarding my workarounds, rather than putting all of them into zig-bootstrap (because I am pretty sure some of those fixes would break... every other target, lol).

Spinning up new infrastructure for CI

builds.sr.ht has the idea of build runners for jobs, so you can dynamically add more capacity for CI as needed. Each runner can have N workers, and those are the ones running QEMU VMs.

As we started with FreeBSD, we learned that a single worker requires 16GB RAM (the actual compile requirements are approximately 8GB, but there are a lot of build artifacts, the filesystem for the VM is composed of both the image and a growing temporary ramdisk). And as time passed, we learned that 4 build workers can handle the throughput needed by Zig when it comes to FreeBSD (as of July 2021, this is true).

To have NetBSD support added to the mix, we would have to add 4 build workers to the network, which means finding 64GB of available RAM somewhere in our infrastructure. LavaTech operates in a hybrid cloud model: There are VPS'es in various cloud providers, but most of our services run in our own colocated hardware, shared with general programming. We are able to do so because of Hurricane Electric's free colo deal. Also, Peer with us!

In general, our rack has two power supply units, one big momma computer (of which we kindly hostnamed it “laserjet”, it was close to being “HP-LaserJet-Enterprise-MFP”), and a bunch of blades.

There will not be a hard requirement for this list (or the infrastructure mentioned in this article) to be updated as time passes.

And this is the build runner allocation before NetBSD work: – runner1: Laserjet VM (2 workers) – runner2: Laserjet VM (2 workers)

After talking to our friends at generalprogramming, we were able to allocate two more runners: – runner3: GP Blade (2 workers) – runner4: GP Blade (2 workers)

However, they were unstable as time passed, and we decided to rent a server which would take care of both FreeBSD and NetBSD: – bigrunner: Hetzner Rented Server (8 workers)

runner1 and runner2 were decommissioned to decrease load on Laserjet. runner3 and runner4 were decommissioned because of their instability.

One infrastructure note to keep in mind is that we aren't using all of the available blades in the rack, because of power issues. The PSUs can handle 20A maximum current, and we have both of them so that we can stay redundant. Even though we have 40A theoretical max current to be used, we must only use 20A max, and we were already on that limit. Foreshadowing: A recent failure in the datacenter brought down one of the PSUs and our router was not plugged into both, causing an outage.

A collection of operation problems that happened in the build network while getting NetBSD CI up

Accidental Ramdisk in Production (July 12th 2021)

The issue was first identified by seeing jobs close successfully with “Connection to localhost closed by remote host”. This is bad (it should fail, really), but we knew that it was related to the OOM killer, as we had it before.

We noticed that the RAM usage on affected runners was 50%, even though no CI jobs were running on them. Before deciding to reboot, thinking it was some cursed I/O caching eating RAM (it wasn't, after checking free output), I checked df and found that the rootfs was a tmpfs, not something like ext4.

Turns out I accidentally installed those runners on the alpine livecd ramdisk. I didn't run setup-alpine. The motd as you enter a shell tells you to run it. I don't know how I missed it.

After backing up the important data (build logs), I was able to properly recreate them. Since I knew that more runners were bound to be created in the future, I wrote a pyinfra script to set everything up beforehand, and became a fangirl of it afterwards. Oh well, it happens.

Network Timeouts (July 13th 2021 and beyond)

Sometimes connectivity to the builds service simply times out with curl: (7) Failed to connect to builds.hut.lavatech.top port 443: Host is unreachable.

There are high packet losses, maybe because of one of our network upstreams, but I can't dive deeper into this issue. Hopefully it isn't as frequent.

Docker networking mishaps (July 13th 2021)

Jobs simply didn't start with the following error:

docker: Error response from daemon: Conflict. The container name "/builds_job_unknown_1626160482" is already in use by container "90c8a6c874840755e9b75db5b3d9b223fbc00ba0a540b60f43d78378efc4376a". You have to remove (or rename) that container to be able to reuse that name.

The name of the Docker container running QEMU is decided in the control script, here's the relevant snippet of code:

        --name "builds_job_${BUILD_JOB_ID:-unknown_$(date +"%s")}" \

The script that actually boots the VM containing a CI job is the control script, usually found in /var/lib/images/control (when using the Alpine Linux Sourcehut packages, at least). Sourcehut documentation says that the user running that script should be locked down to only being able to run it so that vulnerabilities on the worker process don't lead to a privilege escalation. That is done via the doas utility, an alternative to sudo.

A missing parameter on the doas.conf file makes the control script run without any environment variables, since BUILD_JOB_ID is missing, it uses date +"%s".

This issue didn't arise until NetBSD builds were merged into Zig. Before that happened, we only spawned a single job per PR: the FreeBSD one. However, after adding NetBSD, we started spawning two jobs per PR. Since they were close in time, the container names would be the same, one of them would work, and the other would crash with the aforementioned error.

VM Settling times out (July 28th 2021)

Some VMs were timing out on their settling, but only when the machine was under load (such as after a merge spree in the Zig project). When running a CI job, a new VM is created running the specific OS image, then a “settling” task is run for it. The task attempts to SSH into the virtual machine, run echo "hello world", and if that works, send the rest of the commands (cloning the repo, running the scripts declared in the YAML manifest, etc).

We were always missing timing on it, so by the time I connected to the runner, it was fine and stable. But by the date mentioned in the subtitle, we were able to catch it in real time, and saw that the load numbers are close to 16, even though the runner has 8 cores total.

Turns out the CI VMs have 2 cores allocated to them, that's hardcoded in the control script.

Current solution: Edit the control script to make the VMs be single-core.

Possible future solution: Buy more servers

What next?