Post-mortem missing edge packages on mirrors

On 2024-10-25 18:55 UTC we were notified via IRC that package repositories were missing for the edge branch. Investigation confirmed that, except for riscv64, repositories for all other architectures were missing. The repositories for all stable releases were still present.

screenshot of edge repository, showing only riscv64

Checking the dl-master mirror, the repositories there were missing as well. The next step was to check whether the builders themselves still contained all the packages, which fortunately was the case.

To give us time to investigate what happened and prevent any more potential issues, all edge builders were stopped.

Mirror infrastructure

To understand what happened, let us explain how the mirror infrastructure for Alpine Linux works.

Each architecture and each release has a dedicated builder. A builder keeps a complete repository of all the packages it built. After it completes building a repository, it synchronizes the local repository with dl-master.alpinelinux.org, a single server which is tier 0 in our mirror infrastructure.

Next we have 3 tier 1 servers that are geographically spread, which synchronize with dl-master. These tier 1 servers also act as a backend for dl-cdn.alpinelinux.org and rsync.alpinelinux.org, which all other mirrors use to synchronize our repositories with.

diagram showing the various components involved in the mirror infrastructure

So once files are added or removed from dl-master, that change automatically and quickly replicates out to other mirrors.

The culprit

Since this affected just a single release (edge), we could quicky rule out a damaged file system. Checking dmesg output on the server did not reveal anything concerning either.

A clue to what has happened was quicky found in our #alpine-commits IRC channel, where updates to the aports git repository are logged, as well as updates from the builders when they finished building a repository.

At 18:45 UTC, these updates were reported:

2024-10-25 18:44:19     algitbot        edge/main/: uploaded
2024-10-25 18:44:36     algitbot        edge/community/: uploaded
2024-10-25 18:44:49     algitbot        edge/testing/: uploaded

If you compare these messages to other messages, you'll notice the difference:

2024-10-25 18:37:10     algitbot        edge/community/x86_64: uploaded

The architecture was missing.

The back-bone of all builders is a script called aports-build, which is responsible for making sure all packages and releases are built and uploaded. The script is also what sends messages to mqtt, which then get logged to IRC:

mosquitto_pub -h $mqtt_broker -t rsync/$upload_host/$rel/$arch -m "$rel/releases/$arch"

The fact that the architecture was missing in the message from algitbot means that the variable $arch was empty.

This variable is only assigned once in the script, at start:

arch=$(abuild -A)

So $arch being empty must mean the call to abuild -A must have failed for some reason. Why did this cause all the repositories to be removed? The $arch variable is used in the rsync command to upload all packages for a repository:

rsync --recursive \
    --update \
    --itemize-changes \
    --delete-delay \
    --delay-updates \
    --mkpath \
    $rsync_opts \
    $repo/$arch $i/$repo/ > /tmp/upload-$repo

The $arch is in this case used to determine the source directory to be uploaded. This is how the directory structure on a builder looks like:

└── main
   └── x86_64

On dl-master, it looks like this:

main
├── aarch64
├── armhf
├── armv7
├── loongarch64
├── ppc64le
├── riscv64
├── s390x
├── x86
└── x86_64

So under normal circumstances, it would upload main/x86_64 for example from the builder to main/x86_64 on dl-master.

In the case when $arch is empty it will now synchronize main/ on the builder with main/ on dl-master, resulting in all directories except for the current architecture to be removed.

At this point in time, we do not know exactly what caused $arch to be empty.

Restoring

Since all the repositories were still present on the builders, restoring them was trivial. Once we had an idea what the cause of the issue was, we started synchronizing all the repositories from the builders to dl-master again, after which the t1 servers could start synchronizing all repositories.

At 05:45 UTC on 2024-10-26, all repositories were available again on our t1 mirrors.

Mitigation

To prevent this specific issue from happening in the future, we hardened aports-build to stop when $arch is empty and log when this situation happens.

We decided not to introduce any more changes at this time to prevent introducing more issues.

Although the current build infrastructure has served us well for years, it also has its limitations. That’s why we already had ideas about redesigning the build architecture. For now, we do not have any concrete plans, but we'll not be making any larger changes to the current infrastructure.