Post-mortem missing edge packages on mirrors
On 2024-10-25 18:55 UTC we were notified via IRC that package repositories were missing for the edge branch. Investigation confirmed that, except for riscv64
, repositories for all other architectures were missing. The repositories for all stable releases were still present.
Checking the dl-master mirror, the repositories there were missing as well. The next step was to check whether the builders themselves still contained all the packages, which fortunately was the case.
To give us time to investigate what happened and prevent any more potential issues, all edge builders were stopped.
Mirror infrastructure
To understand what happened, let us explain how the mirror infrastructure for Alpine Linux works.
Each architecture and each release has a dedicated builder. A builder keeps a complete repository of all the packages it built. After it completes building a repository, it synchronizes the local repository with dl-master.alpinelinux.org
, a single server which is tier 0 in our mirror infrastructure.
Next we have 3 tier 1 servers that are geographically spread, which synchronize with dl-master. These tier 1 servers also act as a backend for dl-cdn.alpinelinux.org
and rsync.alpinelinux.org
, which all other mirrors use to synchronize our repositories with.
So once files are added or removed from dl-master
, that change automatically and quickly replicates out to other mirrors.
The culprit
Since this affected just a single release (edge), we could quicky rule out a damaged file system. Checking dmesg
output on the server did not reveal anything concerning either.
A clue to what has happened was quicky found in our #alpine-commits
IRC channel, where updates to the aports git repository are logged, as well as updates from the builders when they finished building a repository.
At 18:45 UTC, these updates were reported:
2024-10-25 18:44:19 algitbot edge/main/: uploaded
2024-10-25 18:44:36 algitbot edge/community/: uploaded
2024-10-25 18:44:49 algitbot edge/testing/: uploaded
If you compare these messages to other messages, you'll notice the difference:
2024-10-25 18:37:10 algitbot edge/community/x86_64: uploaded
The architecture was missing.
The back-bone of all builders is a script called aports-build, which is responsible for making sure all packages and releases are built and uploaded. The script is also what sends messages to mqtt, which then get logged to IRC:
mosquitto_pub -h $mqtt_broker -t rsync/$upload_host/$rel/$arch -m "$rel/releases/$arch"
The fact that the architecture was missing in the message from algitbot means that the variable $arch
was empty.
This variable is only assigned once in the script, at start:
arch=$(abuild -A)
So $arch
being empty must mean the call to abuild -A
must have failed for some reason. Why did this cause all the repositories to be removed? The $arch
variable is used in the rsync
command to upload all packages for a repository:
rsync --recursive \
--update \
--itemize-changes \
--delete-delay \
--delay-updates \
--mkpath \
$rsync_opts \
$repo/$arch $i/$repo/ > /tmp/upload-$repo
The $arch
is in this case used to determine the source directory to be uploaded. This is how the directory structure on a builder looks like:
└── main
└── x86_64
On dl-master
, it looks like this:
main
├── aarch64
├── armhf
├── armv7
├── loongarch64
├── ppc64le
├── riscv64
├── s390x
├── x86
└── x86_64
So under normal circumstances, it would upload main/x86_64
for example from the builder to main/x86_64
on dl-master.
In the case when $arch
is empty it will now synchronize main/
on the builder with main/
on dl-master, resulting in all directories except for the current architecture to be removed.
At this point in time, we do not know exactly what caused $arch
to be empty.
Restoring
Since all the repositories were still present on the builders, restoring them was trivial. Once we had an idea what the cause of the issue was, we started synchronizing all the repositories from the builders to dl-master again, after which the t1 servers could start synchronizing all repositories.
At 05:45 UTC on 2024-10-26, all repositories were available again on our t1 mirrors.
Mitigation
To prevent this specific issue from happening in the future, we hardened aports-build to stop when $arch
is empty and log when this situation happens.
We decided not to introduce any more changes at this time to prevent introducing more issues.
Although the current build infrastructure has served us well for years, it also has its limitations. That’s why we already had ideas about redesigning the build architecture. For now, we do not have any concrete plans, but we'll not be making any larger changes to the current infrastructure.