
Hi everyone, As you may have noticed, yesterday was marked by rather consistent CI failures. I believe I have now fixed the root cause but do let me know if you see any further issues. Below I have included a brief post-mortem describing the incident and what we plan to do to prevent it recurring in the future. Cheers, - Ben # Post-mortem Early Friday morning our storage provider experienced a hiccup which rendered the volume which backed our GitLab repositories unavailable for a few minutes. The interruption was long enough that the filesystem remounted as read-only. This caused a bit of filesystem damage affecting a handful of objects in the ghc/perf-notes repository. This resulted in CI failures when attempts to `git push` to this repository failed. To address this I started by ensuring we had an up-to-date backup of our data and began sorting through the various observed errors. Once I had established that the storage volume had been interrupted I went looking for additional corruption. A integrity check of all GitLab repositories revealed no further corruption. This isn't terribly surprising given that the perf notes repository sees the most commit traffic of all repositories hosted on GitLab. While it would likely be possible to recover the corrupted perf-notes objects from clones on the CI builders, I deemed this to be not worth the effort given that these commits hold replaceable performance metric data and the last good commit appears to have been produced mere hours prior to the corrupted HEAD commit. Consequently, I rather reverted the ghc-notes repository to the last-known-good commit (8154013bfdce86fedf2863cb96ccbb723f1144f8). # Planned changes for future mitigation While this incident didn't result in any significant data loss, it was nevertheless a significant headache and resulted in CI failing for the better part of a day. Moreover, this isn't the first time that the network block storage backing GitLab has let us down. At this point, I have lost confidence that network block storage can be trusted. For this reason, in the future we will eliminate this failure mode by adjusting our deployment to avoid relying on network block storage for any deployment-critical data. We hope that we will be able to carry out this change in the coming weeks.

Ben Gamari
Hi everyone,
As you may have noticed, yesterday was marked by rather consistent CI failures. I believe I have now fixed the root cause but do let me know if you see any further issues.
Below I have included a brief post-mortem describing the incident and what we plan to do to prevent it recurring in the future.
Hi all, Unfortunately it seems that our provider's block storage issues have once again reared their ugly head. I will be moving GitLab to a new server in the morning. Until then, expect gitlab.haskell.org to be down. Sorry for any inconvenience. Cheers, - Ben

Ben Gamari
Ben Gamari
writes: Hi everyone,
As you may have noticed, yesterday was marked by rather consistent CI failures. I believe I have now fixed the root cause but do let me know if you see any further issues.
Below I have included a brief post-mortem describing the incident and what we plan to do to prevent it recurring in the future.
Hi all,
Unfortunately it seems that our provider's block storage issues have once again reared their ugly head. I will be moving GitLab to a new server in the morning. Until then, expect gitlab.haskell.org to be down. Sorry for any inconvenience.
A brief update: GitLab is now up again. Moreover, it has been upgraded to GitLab 13.0.6, which hopefully bring some load-time performance improvements. However, do be aware that the Docker images are still being transferred and consequently CI jobs may fail for the next few hours. Also, the Darwin CI runners need to be upgraded and will be down until this happens. I am hopeful that this will happen tomorrow. Cheers, - Ben
participants (1)
-
Ben Gamari