Windows testsuite failures

Hi all, Currently Windows CI is a bit flaky due to some unfortunately rather elusive testsuite driver bugs. Progress in resolving this has been a bit slow due to travel over the last week but I will be back home tomorrow and should be able to resolve the issue soon thereafter. Cheers, - Ben

Hi Ben,
Can we please disable Windows CI? I've spent more time fighting the CI than
doing useful work this week, it's really frustrating.
Since we have no idea how to fix it maybe we should test Windows only before a
release, manually (and use bisect in case of regressions).
Ömer
Ben Gamari
Hi all,
Currently Windows CI is a bit flaky due to some unfortunately rather elusive testsuite driver bugs. Progress in resolving this has been a bit slow due to travel over the last week but I will be back home tomorrow and should be able to resolve the issue soon thereafter.
Cheers,
- Ben _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Sure because only testing once every 6 months is a very very good idea...
Sent from my Mobile
On Fri, Jan 17, 2020, 06:03 Ömer Sinan Ağacan
Hi Ben,
Can we please disable Windows CI? I've spent more time fighting the CI than doing useful work this week, it's really frustrating.
Since we have no idea how to fix it maybe we should test Windows only before a release, manually (and use bisect in case of regressions).
Ömer
Ben Gamari
, 14 Oca 2020 Sal, 14:30 tarihinde şunu yazdı: Hi all,
Currently Windows CI is a bit flaky due to some unfortunately rather
elusive testsuite driver bugs. Progress in resolving this has been a bit slow due to travel over the last week but I will be back home tomorrow and should be able to resolve the issue soon thereafter.
Cheers,
- Ben _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

We release more often than once in 6 months.
We clearly have no idea how to test on Windows. If you know how to do it then
feel free to submit a MR. Otherwise blocking every MR indefinitely is worse than
testing Windows less frequently.
Ömer
Phyx
Sure because only testing once every 6 months is a very very good idea...
Sent from my Mobile
On Fri, Jan 17, 2020, 06:03 Ömer Sinan Ağacan
wrote: Hi Ben,
Can we please disable Windows CI? I've spent more time fighting the CI than doing useful work this week, it's really frustrating.
Since we have no idea how to fix it maybe we should test Windows only before a release, manually (and use bisect in case of regressions).
Ömer
Ben Gamari
, 14 Oca 2020 Sal, 14:30 tarihinde şunu yazdı: Hi all,
Currently Windows CI is a bit flaky due to some unfortunately rather elusive testsuite driver bugs. Progress in resolving this has been a bit slow due to travel over the last week but I will be back home tomorrow and should be able to resolve the issue soon thereafter.
Cheers,
- Ben _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Oh I spent a non-insignificant amount of time back in the phabricator days
to make the CI stable. Now because people were committing to master
directly without going through CI it was always a cat and mouse game and I
gave up eventually.
Now we have rewritten the CI and it's pointing out actual issues in the
compiler. And your suggestion is well let's just ignore it.
How about you use some of that energy to help I stead of taking the easy
way? And I bet you're going to say you don't care about Windows to which I
would say I don't care about the non-threaded runtime and wish we would get
rid of it. But can't always get what you want.
And to say we'll actually fix anything before release doesn't align with
what I've seen so far, which had me scrambling last minute to ensure we can
release Windows instead of making releases without it.
Quite frankly I don't need you to tell me to submit MRs to fix it since
that's what I spent again a lot of time doing. Or maybe you would like to
pay my paycheck so I can spend more than a considerable amount of my free
time on it.
Kind regards,
Tamar
Sent from my Mobile
On Fri, Jan 17, 2020, 06:17 Ömer Sinan Ağacan
We release more often than once in 6 months.
We clearly have no idea how to test on Windows. If you know how to do it then feel free to submit a MR. Otherwise blocking every MR indefinitely is worse than testing Windows less frequently.
Ömer
Phyx
, 17 Oca 2020 Cum, 09:10 tarihinde şunu yazdı: Sure because only testing once every 6 months is a very very good idea...
Sent from my Mobile
On Fri, Jan 17, 2020, 06:03 Ömer Sinan Ağacan
Hi Ben,
Can we please disable Windows CI? I've spent more time fighting the CI
wrote: than
doing useful work this week, it's really frustrating.
Since we have no idea how to fix it maybe we should test Windows only before a release, manually (and use bisect in case of regressions).
Ömer
Ben Gamari
, 14 Oca 2020 Sal, 14:30 tarihinde şunu yazdı: Hi all,
Currently Windows CI is a bit flaky due to some unfortunately rather
elusive testsuite driver bugs. Progress in resolving this has been a bit slow due to travel over the last week but I will be back home tomorrow and should be able to resolve the issue soon thereafter.
Cheers,
- Ben _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Now we have rewritten the CI and it's pointing out actual issues in the compiler. And your suggestion is well let's just ignore it.
When is the last time Windows CI caught an actual bug? All I see is random system failures [1, 2, 3]. It must be catching *some* bugs, but that's a rare event in my experience. Sure, I don't write Windows-specific code (e.g. IO manager, or library code), but then why am I fighting the Windows CI literally every day, it makes no sense. Give an option to skip Windows CI for my patches.
How about you use some of that energy to help I stead of taking the easy way? And I bet you're going to say you don't care about Windows to which I would say I don't care about the non-threaded runtime and wish we would get rid of it. But can't always get what you want.
I'm not suggesting we release buggy GHCs for Windows or stop Windows support.
And to say we'll actually fix anything before release doesn't align with what I've seen so far, which had me scrambling last minute to ensure we can release Windows instead of making releases without it.
Are you saying we skip a platform we support when it's buggy? That makes no sense. I don't know when did Windows become a first-tier platform but since it is now we should be releasing Windows binaries similar to Linux and OSX binaries. It's not uncommon to do some testing for every patch, and do more comprehensive testing before releases. We did this many times in other projects in the past and I know some other compilers do this today.
Quite frankly I don't need you to tell me to submit MRs to fix it since that's what I spent again a lot of time doing. Or maybe you would like to pay my paycheck so I can spend more than a considerable amount of my free time on it.
I wish someone paid me for the time I wasted because I'm only paid by the time I
spend productively. I'd be happier waiting for the CI then.
Ömer
[1]: https://gitlab.haskell.org/ghc/ghc/-/jobs/237457
[2]: https://gitlab.haskell.org/osa1/ghc/-/jobs/238236
[3]: https://gitlab.haskell.org/osa1/ghc/-/jobs/237279
Phyx
Oh I spent a non-insignificant amount of time back in the phabricator days to make the CI stable. Now because people were committing to master directly without going through CI it was always a cat and mouse game and I gave up eventually.
Now we have rewritten the CI and it's pointing out actual issues in the compiler. And your suggestion is well let's just ignore it.
How about you use some of that energy to help I stead of taking the easy way? And I bet you're going to say you don't care about Windows to which I would say I don't care about the non-threaded runtime and wish we would get rid of it. But can't always get what you want.
And to say we'll actually fix anything before release doesn't align with what I've seen so far, which had me scrambling last minute to ensure we can release Windows instead of making releases without it.
Quite frankly I don't need you to tell me to submit MRs to fix it since that's what I spent again a lot of time doing. Or maybe you would like to pay my paycheck so I can spend more than a considerable amount of my free time on it.
Kind regards, Tamar
Sent from my Mobile
On Fri, Jan 17, 2020, 06:17 Ömer Sinan Ağacan
wrote: We release more often than once in 6 months.
We clearly have no idea how to test on Windows. If you know how to do it then feel free to submit a MR. Otherwise blocking every MR indefinitely is worse than testing Windows less frequently.
Ömer
Phyx
, 17 Oca 2020 Cum, 09:10 tarihinde şunu yazdı: Sure because only testing once every 6 months is a very very good idea...
Sent from my Mobile
On Fri, Jan 17, 2020, 06:03 Ömer Sinan Ağacan
wrote: Hi Ben,
Can we please disable Windows CI? I've spent more time fighting the CI than doing useful work this week, it's really frustrating.
Since we have no idea how to fix it maybe we should test Windows only before a release, manually (and use bisect in case of regressions).
Ömer
Ben Gamari
, 14 Oca 2020 Sal, 14:30 tarihinde şunu yazdı: Hi all,
Currently Windows CI is a bit flaky due to some unfortunately rather elusive testsuite driver bugs. Progress in resolving this has been a bit slow due to travel over the last week but I will be back home tomorrow and should be able to resolve the issue soon thereafter.
Cheers,
- Ben _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

On Fri, Jan 17, 2020 at 7:02 AM Ömer Sinan Ağacan
Now we have rewritten the CI and it's pointing out actual issues in the compiler. And your suggestion is well let's just ignore it.
When is the last time Windows CI caught an actual bug? All I see is random system failures [1, 2, 3].
[1]: Symbolic link privileges are missing from the CI user, something has gone wrong with the permissions on that slave. There's code in the testsuite to symlink or copy. Should fix the permissions, or add permission detection to the python code or switch to copy. [2]: git checkout error, disk probably full. Testsuite runs tend to create a lot of temp files which aren't cleaned up. Over time the disk fills and you get errors such as these. There's a cron job to periodically clean these, but of course that is prone to a race condition. This can be made more reliable by using OS event triggers instead of a cron job. i.e. monitor disk 80% full events and run the cleanup. [3]: It's either trying to execute a non-executable file or something it executed loaded a shared library for a different architecture. Hard to tell which one by just that output. Will need more logs. Now to answer your question [4] and [5] are issues the CI caught that were quite important. [4] https://gitlab.haskell.org/ghc/ghc/issues/17480 [5] https://gitlab.haskell.org/ghc/ghc/issues/17691 just to name two, I can go on, plugin test failures, which pointed out someone submitted an patch tested only on ELF that broke loading on plugins as non-shared objects, etc. The list is quite long.
It must be catching *some* bugs, but that's a rare event in my experience.
Sure if you go "ahh it's just Windows that's broken" and don't look at the underlying issues.
Sure, I don't write Windows-specific code (e.g. IO manager, or library code), but then why am I fighting the Windows CI literally every day, it makes no sense. Give an option to skip Windows CI for my patches.
How about you use some of that energy to help I stead of taking the easy way? And I bet you're going to say you don't care about Windows to which I would say I don't care about the non-threaded runtime and wish we would get rid of it. But can't always get what you want.
I'm not suggesting we release buggy GHCs for Windows or stop Windows support.
I'm sorry, how is disabling the Windows CI not exactly that? If you Disabling the CI just means you test it even less. You test it even less means by the time you get to testing it the issues are too many to fix. Over time you just stop trying and stop releasing it. So sorry, how *exactly* is your suggestion not exactly that.
And to say we'll actually fix anything before release doesn't align with what I've seen so far, which had me scrambling last minute to ensure we can release Windows instead of making releases without it.
Are you saying we skip a platform we support when it's buggy? That makes no sense. I don't know when did Windows become a first-tier platform but since it is now we should be releasing Windows binaries similar to Linux and OSX binaries.
It's *always* been a tier one platform as far as I can tell. It's certainly been for the past 6 years.
It's not uncommon to do some testing for every patch, and do more comprehensive testing before releases. We did this many times in other projects in the past and I know some other compilers do this today.
Yes, but a project that doesn't test a tier one platform during development, which is what your want to do means it's not tier one. Which means you won't fix it for release.
Quite frankly I don't need you to tell me to submit MRs to fix it since that's what I spent again a lot of time doing. Or maybe you would like to pay my paycheck so I can spend more than a considerable amount of my free time on it.
I wish someone paid me for the time I wasted because I'm only paid by the time I spend productively. I'd be happier waiting for the CI then.
Yeah, not waiting for CI is how we got in this mess in the first place. Tamar.
Ömer
[1]: https://gitlab.haskell.org/ghc/ghc/-/jobs/237457 [2]: https://gitlab.haskell.org/osa1/ghc/-/jobs/238236 [3]: https://gitlab.haskell.org/osa1/ghc/-/jobs/237279
Phyx
, 17 Oca 2020 Cum, 09:49 tarihinde şunu yazdı: Oh I spent a non-insignificant amount of time back in the phabricator
days to make the CI stable. Now because people were committing to master directly without going through CI it was always a cat and mouse game and I gave up eventually.
Now we have rewritten the CI and it's pointing out actual issues in the
compiler. And your suggestion is well let's just ignore it.
How about you use some of that energy to help I stead of taking the easy
way? And I bet you're going to say you don't care about Windows to which I would say I don't care about the non-threaded runtime and wish we would get rid of it. But can't always get what you want.
And to say we'll actually fix anything before release doesn't align with
what I've seen so far, which had me scrambling last minute to ensure we can release Windows instead of making releases without it.
Quite frankly I don't need you to tell me to submit MRs to fix it since
that's what I spent again a lot of time doing. Or maybe you would like to pay my paycheck so I can spend more than a considerable amount of my free time on it.
Kind regards, Tamar
Sent from my Mobile
On Fri, Jan 17, 2020, 06:17 Ömer Sinan Ağacan
wrote:
We release more often than once in 6 months.
We clearly have no idea how to test on Windows. If you know how to do
it then
feel free to submit a MR. Otherwise blocking every MR indefinitely is worse than testing Windows less frequently.
Ömer
Phyx
, 17 Oca 2020 Cum, 09:10 tarihinde şunu yazdı: Sure because only testing once every 6 months is a very very good
idea...
Sent from my Mobile
On Fri, Jan 17, 2020, 06:03 Ömer Sinan Ağacan
wrote:
Hi Ben,
Can we please disable Windows CI? I've spent more time fighting the
CI than
doing useful work this week, it's really frustrating.
Since we have no idea how to fix it maybe we should test Windows only before a release, manually (and use bisect in case of regressions).
Ömer
Ben Gamari
, 14 Oca 2020 Sal, 14:30 tarihinde şunu yazdı: Hi all,
Currently Windows CI is a bit flaky due to some unfortunately
rather elusive testsuite driver bugs. Progress in resolving this has been a bit slow due to travel over the last week but I will be back home tomorrow and should be able to resolve the issue soon thereafter.
Cheers,
- Ben _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Ömer Sinan Ağacan
Now we have rewritten the CI and it's pointing out actual issues in the compiler. And your suggestion is well let's just ignore it.
When is the last time Windows CI caught an actual bug? All I see is random system failures [1, 2, 3].
It must be catching *some* bugs, but that's a rare event in my experience.
It's unfortunately not nearly as rare as you would hope. However, it is likely that these cases aren't widely seen as I have been working on marking broken tests as expect_broken over the last few months. Only recently (for roughly a month now, IIRC) has Windows been a mandatory-green platform and it took a significant amount of effort and several false-starts to get to this point. Cheers, - Ben

Both Tamar and Omer are right.
* Not doing CI on Windows is bad. It means that bugs get introduced, and not discovered until later. This is Tamer’s point, and it is valid.
* Holding up MRs because a failures in Windows CI that is unrelated to the MR is also bad. It a frustrating waste of time, and discourages all authors. (In contrast, holding up an MR because it introduces a bug on Windows is fine, indeed desirable.) This is Omer’s point, and it is valid.
The obvious solution is: let’s fix Windows CI, so that it doesn’t fail except when the MR genuinely introduces a bug.
How hard would it be to do that? Do we even know what the problem is?
Simon
From: ghc-devs
Sure because only testing once every 6 months is a very very good idea...
Sent from my Mobile
On Fri, Jan 17, 2020, 06:03 Ömer Sinan Ağacan
mailto:omeragacan@gmail.com> wrote: Hi Ben,
Can we please disable Windows CI? I've spent more time fighting the CI than doing useful work this week, it's really frustrating.
Since we have no idea how to fix it maybe we should test Windows only before a release, manually (and use bisect in case of regressions).
Ömer
Ben Gamari
mailto:ben@smart-cactus.org>, 14 Oca 2020 Sal, 14:30 tarihinde şunu yazdı: Hi all,
Currently Windows CI is a bit flaky due to some unfortunately rather elusive testsuite driver bugs. Progress in resolving this has been a bit slow due to travel over the last week but I will be back home tomorrow and should be able to resolve the issue soon thereafter.
Cheers,
- Ben _______________________________________________ ghc-devs mailing list ghc-devs@haskell.orgmailto:ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devshttps://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fghc-devs&data=02%7C01%7Csimonpj%40microsoft.com%7C20b2daaf0e674142e41508d79b1968ad%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637148405761605935&sdata=VmLpBxMbcYWYHaa34KpF1ju2SURx%2BOb5C5g5Gi7YoGE%3D&reserved=0
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.orgmailto:ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devshttps://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fghc-devs&data=02%7C01%7Csimonpj%40microsoft.com%7C20b2daaf0e674142e41508d79b1968ad%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637148405761610933&sdata=z28IhdWnHfBKvYTakK6lh8qhr4VPMUBzl71RXBrYkqA%3D&reserved=0

Simon Peyton Jones via ghc-devs
Both Tamar and Omer are right.
* Not doing CI on Windows is bad. It means that bugs get introduced, and not discovered until later. This is Tamer’s point, and it is valid. * Holding up MRs because a failures in Windows CI that is unrelated to the MR is also bad. It a frustrating waste of time, and discourages all authors. (In contrast, holding up an MR because it introduces a bug on Windows is fine, indeed desirable.) This is Omer’s point, and it is valid.
The obvious solution is: let’s fix Windows CI, so that it doesn’t fail except when the MR genuinely introduces a bug.
How hard would it be to do that? Do we even know what the problem is?
This latest issue was quite tricky since the root cause was in an unexpected place (it seems that some of the Windows GitLab runners somehow no longer had symlink permission, perhaps due to an operating system update; I had expected the problem to be in the testsuite driver due to previous issues in that area [1]). Given the relative scarcity of Windows CI capacity and the difficulty of hitting the issue to begin with, it took quite a while to realized the problem. However, this morning I identified the issue and, as a workaround, temporarily disabled forced usage of symlinks on Windows CI. I have also opened #17706, which should allow us to use symlinks without fear of this potential breakage. Cheers, - Ben [1] https://gitlab.haskell.org/ghc/ghc/commit/e35fe8d58f18bd179efdc848c617dc9edd...

Ömer Sinan Ağacan
Hi Ben,
Can we please disable Windows CI? I've spent more time fighting the CI than doing useful work this week, it's really frustrating.
Yes, this recent spate is issues indeed took too long to solve. Unfortunately this particular issue took quite a while to isolate since I had trouble reproducing it as it depended upon which CI builder the job ran on. However, I eventually found and pushed a patch to fix the root cause this morning.
Since we have no idea how to fix it maybe we should test Windows only before a release, manually (and use bisect in case of regressions).
As pointed out by Tamar, this really is not a viable option. In truth, the pain we are experiencing now is precisely because we neglected Windows support for so long. I am now making a concerted effort to fix this and, while it has been painful, we are gradually approaching a half-way decent story for Windows. x86-64 is, as of this morning, believed to be mostly stable; I'm now working on i386. Cheers, - Ben
participants (5)
-
Ben Gamari
-
Ben Gamari
-
Phyx
-
Simon Peyton Jones
-
Ömer Sinan Ağacan