GitLab partial outage - attempting to mitigate - ghc-devs

newer
Buy-in for technical proposal 47...

GitLab partial outage - attempting to mitigate

Bryan Richter

20 Mar 2023 20 Mar '23

12:41 p.m.

I am seeing a few different problems with the GitLab server right now. I am gonna try to mitigate the issues, so the server might be unavailable for a few short periods.

Attachments:

attachment.html (text/html — 195 bytes)

Show replies by date

I eventually resorted to a server reboot, which cleared up all the problems I was seeing. I think we're back in business. Symptoms were: * No new data coming into https://grafana.gitlab.haskell.org/d/iiCppweMz/marge-bot?orgId=2&from=now-7d&to=now&refresh=30m * High-frequency repetition of the system log message "systemd-journald[1622008]: Failed to open runtime journal: Device or resource busy" * ~50% failure rate connecting to the server with ssh None of those are happening anymore. -Bryan On Mon, 20 Mar 2023 at 14:41, Bryan Richter wrote:

...

I am seeing a few different problems with the GitLab server right now. I am gonna try to mitigate the issues, so the server might be unavailable for a few short periods.

Ben Gamari

1:25 p.m.

Bryan Richter via ghc-devs writes:

...

I eventually resorted to a server reboot, which cleared up all the problems I was seeing. I think we're back in business.

The root partition was close to running out of disk space yesterday. The problem appears to be that /nix is located on the small system drive. We should really address this although moving /nix is sadly not easy and will certainly require downtime. Cheers, - Ben

Brandon Allbery

1:40 p.m.

Isn't it just "move /nix out of the way, bind mount a new one from a larger drive, use rsync to move the data"? On Mon, Mar 20, 2023 at 9:25 AM Ben Gamari wrote:

...

Bryan Richter via ghc-devs writes:

...
I eventually resorted to a server reboot, which cleared up all the problems I was seeing. I think we're back in business.

The root partition was close to running out of disk space yesterday. The problem appears to be that /nix is located on the small system drive. We should really address this although moving /nix is sadly not easy and will certainly require downtime.

Cheers,

- Ben

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

-- brandon s allbery kf8nh allbery.b@gmail.com

Ben Gamari

3:32 p.m.

Brandon Allbery writes:

...

Isn't it just "move /nix out of the way, bind mount a new one from a larger drive, use rsync to move the data"?

Something like that, yes [1]. Cheers, - Ben [1] https://nixos.wiki/wiki/Storage_optimization#Moving_the_store

Ben Gamari

1:44 p.m.

Ben Gamari writes:

...

Bryan Richter via ghc-devs writes:

...
I eventually resorted to a server reboot, which cleared up all the problems I was seeing. I think we're back in business.

The root partition was close to running out of disk space yesterday. The problem appears to be that /nix is located on the small system drive. We should really address this although moving /nix is sadly not easy and will certainly require downtime.

In the meantime, I have significantly reduced the number of snapshots retained in the root dataset. This brought disk usage down from 70% to 12%, which should keep us afloat for a long while. Cheers, - Ben

836

Age (days ago)

836

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

Ben Gamari
Brandon Allbery
Bryan Richter