GitLab status

20 Jun 2020

      Hi everyone,

As you may have noticed, yesterday was marked by rather consistent CI
failures. I believe I have now fixed the root cause but do let me know
if you see any further issues.

Below I have included a brief post-mortem describing the incident and
what we plan to do to prevent it recurring in the future.

Cheers,

- Ben

# Post-mortem

Early Friday morning our storage provider experienced a hiccup which
rendered the volume which backed our GitLab repositories unavailable for
a few minutes. The interruption was long enough that the filesystem
remounted as read-only. This caused a bit of filesystem damage affecting
a handful of objects in the ghc/perf-notes repository. This resulted in
CI failures when attempts to `git push` to this repository failed.

To address this I started by ensuring we had an up-to-date backup of our
data and began sorting through the various observed errors. Once I had
established that the storage volume had been interrupted I went looking
for additional corruption. A integrity check of all GitLab repositories
revealed no further corruption. This isn't terribly surprising given
that the perf notes repository sees the most commit traffic of all
repositories hosted on GitLab.

While it would likely be possible to recover the corrupted
perf-notes objects from clones on the CI builders, I deemed this to be
not worth the effort given that these commits hold replaceable
performance metric data and the last good commit appears to have been
produced mere hours prior to the corrupted HEAD commit. Consequently, I
rather reverted the ghc-notes repository to the last-known-good commit
(8154013bfdce86fedf2863cb96ccbb723f1144f8).

# Planned changes for future mitigation

While this incident didn't result in any significant data loss, it was
nevertheless a significant headache and resulted in CI failing for the
better part of a day. Moreover, this isn't the first time that the
network block storage backing GitLab has let us down. At this point, I
have lost confidence that network block storage can be trusted.

For this reason, in the future we will eliminate this failure mode by
adjusting our deployment to avoid relying on network block storage for
any deployment-critical data. We hope that we will be able to carry out
this change in the coming weeks.

Ben Gamari

Ben Gamari

Ben Gamari

tags

participants (1)