[GHC] #8591: Concurrent executions of ghc-pkg can cause inconstant package.cache files

#8591: Concurrent executions of ghc-pkg can cause inconstant package.cache files ------------------------------------+---------------------------- Reporter: janm | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Package system | Version: 7.6.3 Keywords: ghc-pkg race | Operating System: FreeBSD Architecture: Unknown/Multiple | Type of failure: Other Difficulty: Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | ------------------------------------+---------------------------- I am doing 24 way parallel builds of system images, including all packages on a system. This includes ghc and multiple ghc packages. I am seeing intermittent dependency failure from the ghc packaging system. From examining Main.hs in ghc-pkg, I see the function withFileAtomic write to a temporary file in package.conf.d and then atomically rename on top of a target file, package.cache in the case. With parallel execution the last rename would win, leading to lost entries in package.cache. In my case, the following things happened: ("Building" indicates a start, "Built" indicates completion, "Installing" is setup in a separate chroot'd environment and is isolated) The FreeBSD ports system is used to drive the Haskell build system. The process works single threaded and fails intermittently when done in parallel. Building: devel/hs-data-default-instances-base Building: devel/hs-data-default-instances-containers Building: devel/hs-data-default-instances-old-locale Built: devel/hs-dlist Building: devel/hs-data-default-instances-dlist Built: devel/hs-temporary Built: jail-image-full Installing: system-image__jail-image-full Built: devel/hs-base64-bytestring Built: archivers/hs-zlib Building: security/hs-digest Built: devel/hs-syb Building: textproc/hs-hs-bibutils Building: textproc/hs-pandoc-types Built: devel/hs-utf8-string Built: devel/hs-data-default-instances-old-locale Built: devel/hs-data-default-instances-containers Built: devel/hs-data-default-instances-base Built: devel/hs-data-default-instances-dlist Building: devel/hs-data-default Built: devel/hs-random Installed: system-image__lang/ghc Installing: system-image__archivers/hs-zlib Installing: system-image__devel/hs-utf8-string Installing: system-image__devel/hs-syb Installing: system-image__devel/hs-base64-bytestring Installing: system-image__devel/hs-data-default-class Installing: system-image__devel/hs-dlist Installing: system-image__devel/hs-random Installing: system-image__devel/hs-temporary Installing: system-image__devel/hs-extensible-exceptions Built: devel/hs-data-default FAILED The error from the Haskell data-default build was: setup: At least the following dependencies are missing: data-default-instances-base -any Looking in the in the package.conf.d directory shows that the data- default-instances-base-0.0.1-7bdf8678f0d8637e096e397e7910f82a.conf file was present, but running "ghc-pkg list" did not show data-default- instances-base Running /usr/local/lib/cabal/ghc-7.6.3/data-default-instances- base-0.0.1/register.sh (which was also present) caused ghc-pkg to now show data-default-instances-base. To me this looks like a race condition between multiple instances of ghc- pkg causing the cache to become inconsistent. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8591 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8591: Concurrent executions of ghc-pkg can cause inconstant package.cache files -------------------------------------+------------------------------------- Reporter: janm | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Package | Version: 7.6.3 system | Keywords: ghc-pkg race Resolution: | Architecture: Unknown/Multiple Operating System: FreeBSD | Difficulty: Unknown Type of failure: Other | Blocked By: Test Case: | Related Tickets: Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by thomie): The comment [https://github.com/ghc/ghc/blob/74a6a8a979837d1344fc3236ad6fc4ca76ea49a7/uti... /ghc-pkg/Main.hs#L1984 "Big fat hairy race condition"] might have something to do with this. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8591#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8591: Concurrent executions of ghc-pkg can cause inconstant package.cache files -------------------------------------+------------------------------------- Reporter: janm | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Package | Version: 7.6.3 system | Keywords: ghc-pkg race Resolution: | Architecture: Unknown/Multiple Operating System: FreeBSD | Difficulty: Unknown Type of failure: Other | Blocked By: Test Case: | Related Tickets: Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by janm): Replying to [comment:1 thomie]:
The comment [https://github.com/ghc/ghc/blob/74a6a8a979837d1344fc3236ad6fc4ca76ea49a7/uti... /ghc-pkg/Main.hs#L1984 "Big fat hairy race condition"] might have something to do with this.
That comment is in the MINGW32 host or target branch of the #if, dealing with the case where renameFile can't rename over an existing file on the Win32 platform. The real race is depending on renameFile at all, as in the #else case of of that #if, as is the case on the platform I'm using, FreeBSD. The "atomic property" referred to in the comment here is for single- process execution, the race I'm talking about is the race between two processes to call renameFile. A better approach would be for the process to open the targetFile with O_EXCL (and create it with O_CREAT if it doesn't exist), and hold it open while doing the atomic rename on top of the target. That way the targetFile is used a lock to synchronise updates between concurrent processes. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8591#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8591: Concurrent executions of ghc-pkg can cause inconstant package.cache files -------------------------------------+------------------------------------- Reporter: janm | Owner: pgj Type: bug | Status: new Priority: normal | Milestone: Component: Package | Version: 7.6.3 system | Keywords: ghc-pkg race Resolution: | Architecture: Unknown/Multiple Operating System: FreeBSD | Difficulty: Unknown Type of failure: Other | Blocked By: Test Case: | Related Tickets: Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by pgj): * owner: => pgj Comment: Wow, I did not even know that such ticket exists. Sorry for missing this. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8591#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

A better approach would be for the process to open the targetFile with O_EXCL (and create it with O_CREAT if it doesn't exist), and hold it open while doing the atomic rename on top of the target. That way the targetFile is used a lock to synchronise updates between concurrent
#8591: Concurrent executions of ghc-pkg can cause inconstant package.cache files -------------------------------------+------------------------------------- Reporter: janm | Owner: pgj Type: bug | Status: new Priority: normal | Milestone: Component: Package | Version: 7.6.3 system | Keywords: ghc-pkg race Resolution: | Architecture: Unknown/Multiple Operating System: FreeBSD | Difficulty: Unknown Type of failure: Other | Blocked By: Test Case: | Related Tickets: Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by pgj): Replying to [comment:2 janm]: processes. Does the same problem persist with GHC 7.8.x? Unfortunately, I do not own such a beefy hardware, although I may be able to get some help in that respect from the FreeBSD Project. What is the version of the operating system (that is, ideally {{{OSVERSION}}})? I guess the architecture is {{{amd64}}}. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8591#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8591: Concurrent executions of ghc-pkg can cause inconstant package.cache files -------------------------------------+------------------------------------- Reporter: janm | Owner: pgj Type: bug | Status: new Priority: normal | Milestone: Component: Package | Version: 7.6.3 system | Keywords: ghc-pkg race Resolution: | Architecture: Unknown/Multiple Operating System: FreeBSD | Difficulty: Unknown Type of failure: Other | Blocked By: Test Case: | Related Tickets: Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by janm): I am running a test build now but code inspection shows that the same race is present. I expect it to fail. We resolved the issue by single-threading all package installation in the build process. We are running on 9.3-p5 at the moment -- It was probably 9.2-RELEASE or 9-STABLE at the time of the original bug report. I also see that the FreeBSD port for lang/ghc limits concurrency to 4 jobs, added in SVN rev 348842. Assuming you're also pgj@freebsd you committed the change, so probably know! This was in response to FreeBSD bug ports/186829 (see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=186829 ). The same problem seems to be the underlying cause. The error trace in the bug report shows cabal failing to install a package, just like I saw in my system build processes. Resolving the problem in ghc-pkg should resolve this problem and allow lang/ghc builds to run with full parallelism. In principle, the withFileAtomic function should (in order): 1. take a lock on the target file (creating it if it is not present) 2. create the temporary file 3. write to the temporary file 4. close the temporary file 5. rename the temporary file on top of the locked target file 6. close the original target file which is now no longer present in the file system. Concurrent execution will now be serialised around the lock on the target file. This doesn't resolve the "big hairy race condition" on Windows, but it should resolve the big hairy race condition I'm hitting that isn't mentioned code comments. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8591#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8591: Concurrent executions of ghc-pkg can cause inconstant package.cache files -------------------------------------+------------------------------------- Reporter: janm | Owner: pgj Type: bug | Status: new Priority: normal | Milestone: Component: Package | Version: 7.6.3 system | Keywords: ghc-pkg race Resolution: | Architecture: Unknown/Multiple Operating System: FreeBSD | Difficulty: Unknown Type of failure: Other | Blocked By: Test Case: | Related Tickets: Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by thomie): Not reading this thread closely, but since there's also a comment that says "copied from Cabal's Distribution.Simple.Utils", there might be a function in there now that does exactly what you need: https://github.com/haskell/cabal/blob/master/Cabal/Distribution/Simple/Utils... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8591#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8591: Concurrent executions of ghc-pkg can cause inconstant package.cache files -------------------------------------+------------------------------------- Reporter: janm | Owner: pgj Type: bug | Status: new Priority: normal | Milestone: Component: Package | Version: 7.6.3 system | Keywords: ghc-pkg race Resolution: | Architecture: Unknown/Multiple Operating System: FreeBSD | Difficulty: Unknown Type of failure: Other | Blocked By: Test Case: | Related Tickets: Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by janm): Thanks for the reply. A quick code inspection of the renameFile function shows the same race. The function is atomic for single threaded execution (ie. all or nothing for the updated file), but updates can be lost when there is concurrent execution of two processes. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8591#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8591: Concurrent executions of ghc-pkg can cause inconstant package.cache files -------------------------------------+------------------------------------- Reporter: janm | Owner: pgj Type: bug | Status: new Priority: normal | Milestone: Component: Package | Version: 7.6.3 system | Keywords: ghc-pkg race Resolution: | Architecture: Unknown/Multiple Operating System: FreeBSD | Difficulty: Unknown Type of failure: Other | Blocked By: Test Case: | Related Tickets: Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by pgj): Replying to [comment:5 janm]:
I am running a test build now but code inspection shows that the same race is present. I expect it to fail.
All right, excellent, thanks.
I also see that the FreeBSD port for lang/ghc limits concurrency to 4 jobs, added in SVN rev 348842. Assuming you're also pgj@freebsd you committed the change, so probably know! This was in response to FreeBSD bug ports/186829.
Well, to be honest, I did not realize that it was a related issue. In the past, the GHC build system already had problems with parallel builds on FreeBSD due to e.g. low-resolution timestamps in the file system. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8591#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8591: Concurrent executions of ghc-pkg can cause inconstant package.cache files -------------------------------------+------------------------------------- Reporter: janm | Owner: pgj Type: bug | Status: new Priority: normal | Milestone: Component: Package | Version: 7.6.3 system | Keywords: ghc-pkg race Resolution: | Architecture: Unknown/Multiple Operating System: FreeBSD | Difficulty: Unknown Type of failure: Other | Blocked By: Test Case: | Related Tickets: Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by pgj): Replying to [comment:6 thomie]:
Not reading this thread closely, but since there's also a comment that says "copied from Cabal's Distribution.Simple.Utils", there might be a function in there now that does exactly what you need:
https://github.com/haskell/cabal/blob/master/Cabal/Distribution/Simple/Utils... Yeah, I know that -- this is the issue that also causes me headaches on my Windows builders recently. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8591#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8591: Concurrent executions of ghc-pkg can cause inconstant package.cache files -------------------------------------+------------------------------------- Reporter: janm | Owner: pgj Type: bug | Status: new Priority: normal | Milestone: Component: Package | Version: 7.6.3 system | Keywords: ghc-pkg race Resolution: | Architecture: Unknown/Multiple Operating System: FreeBSD | Difficulty: Unknown Type of failure: Other | Blocked By: Test Case: | Related Tickets: Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Comment (by janm): Replying to [comment:8 pgj]:
Replying to [comment:5 janm]:
I am running a test build now but code inspection shows that the same race is present. I expect it to fail.
All right, excellent, thanks.
Interestingly enough it didn't fail. I suspect that it is because we have switch to pkg from the old-style packaging system on FreeBSD. We have local modifications to retry database access due to contention during package installation. This may be serialising execution of the Haskell installation processes. I will remove the JOBS limit in lang/ghc and rerun -- that should fail because there is no pkg database. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8591#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8591: Concurrent executions of ghc-pkg can cause inconstant package.cache files -------------------------------------+------------------------------------- Reporter: janm | Owner: pgj Type: bug | Status: new Priority: normal | Milestone: Component: Package system | Version: 7.6.3 Resolution: | Keywords: ghc-pkg Operating System: Unknown/Multiple | race Type of failure: Other | Architecture: Blocked By: | Unknown/Multiple Related Tickets: #10205 | Test Case: | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by thomie): * os: FreeBSD => Unknown/Multiple * related: => #10205 Comment: janm: are you installing the packages with cabal or manually? The GHC build system (in ghc.mk) contains the following comment: {{{ # register the boot packages in strict sequence, because running # multiple ghc-pkgs in parallel doesn't work (registrations may get # lost). }}} Presumably `cabal -j` also registers the packages sequentially, or there'd be many more reported issues. Changing just `withFileAtomic` as you propose in comment:2 and comment:5 wouldn't solve the problem. Two concurrent ghc-pkg processes still wouldn't know about each other's modifications to the package database, so the last one to call `withFileAtomic` would still win. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8591#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC