[GHC] #9266: getDirectoryContents blow its stack in a huge directory

#9266: getDirectoryContents blow its stack in a huge directory -------------------------------------------+------------------------------- Reporter: joeyhess | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: libraries/directory | Version: 7.8.2 Keywords: | Operating System: Linux Architecture: Unknown/Multiple | Type of failure: Difficulty: Easy (less than 1 hour) | None/Unknown Blocked By: | Test Case: Related Tickets: | Blocking: -------------------------------------------+------------------------------- Once a directory has around 2 million files in it, a lack of an accumulator in getDirectoryContents (unix version only; windows already has an acc) causes it to blow the stack: {{{ joey@darkstar:‾/src/git-annex>cat test.hs import System.Directory main = do l <- getDirectoryContents "/tmp/big" print (null l) joey@darkstar:‾/src/git-annex>ghc --make -O2 test [1 of 1] Compiling Main ( test.hs, test.o ) Linking test ... joey@darkstar:‾/src/git-annex>./test Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize -RTS' to increase it. }}} I suggest the attached patch. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9266 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9266: getDirectoryContents blow its stack in a huge directory -------------------------------------+------------------------------------- Reporter: joeyhess | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: | Version: 7.8.2 libraries/directory | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: Linux | Difficulty: Easy (less than 1 Type of failure: None/Unknown | hour) Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by joeyhess): While my patch avoids the stack overflow, getDirectoryContents still seems to be using more memory than I would expect if it's lazily generating the list of files. The test program uses around 600 mb. I think it also needs to use unsafeInterleaveIO, but have not been able to get that to work. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9266#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9266: getDirectoryContents blow its stack in a huge directory -------------------------------------+------------------------------------- Reporter: joeyhess | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: | Version: 7.8.2 libraries/directory | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: Linux | Difficulty: Easy (less than 1 Type of failure: None/Unknown | hour) Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by joeyhess): I've gotten it to work with unsafePerformIO. I can provide a patch, but I don't think you'l find this approach acceptable. The problem is that any errors would be deferred until the list is consumed, ie the readFile problem all over again. With that said, it does work; the test program now uses only 1776kb, and a variant that gets the length of the list behaves similarly well. I wonder if it would be better to add a overDirectoryContents :: FilePath -> (FilePath -> IO ()) -> IO () or something like that? This would work in my particular use case, at least. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9266#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9266: getDirectoryContents blow its stack in a huge directory -------------------------------------+------------------------------------- Reporter: joeyhess | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: | Version: 7.8.2 libraries/directory | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: Linux | Difficulty: Easy (less than 1 Type of failure: None/Unknown | hour) Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by nomeata): This seems to be an instance of the problem described at http://www .joachim-breitner.de/blog/archives/620-Constructing-a-list-in-a-Monad.html (for which I have not found a suitable solution yet). Have you tried passing an a dlist in the accumulator? OTOH, it shouldn’t perform better than your first patch, just result in a different order... You say you use 7.8. Does this mean that the rumors that the stack would be unlimited in 7.8 are not true? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9266#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9266: getDirectoryContents blow its stack in a huge directory ----------------------------------------+---------------------------------- Reporter: joeyhess | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: libraries/directory | Version: 7.6.2 Resolution: | Keywords: Operating System: Linux | Architecture: Type of failure: None/Unknown | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: ----------------------------------------+---------------------------------- Changes (by joeyhess): * difficulty: Easy (less than 1 hour) => Unknown * version: 7.8.2 => 7.6.2 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9266#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9266: getDirectoryContents blow its stack in a huge directory ----------------------------------------+---------------------------------- Reporter: joeyhess | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: libraries/directory | Version: 7.6.2 Resolution: | Keywords: Operating System: Linux | Architecture: Type of failure: None/Unknown | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: ----------------------------------------+---------------------------------- Comment (by joeyhess): Sorry, had wrong ghc version entered before. It might indeed save some memory to deepseq the accumulator as described in the blog post. I have not tried, since in my use case I want to avoid buffering the whole list in memory at all. I'm currently using a getDirectoryContents' that uses unsafeInterleaveIO. To avoid exceptions being thrown after the call has succeeded, when the return list is traversed, I made it catch and discard exceptions from Posix.readDirStream etc, so in an exceptional condition the list may not contain all items in the directory. That was ok in my use case, but I dunno if it would be acceptable for the real getDirectoryContents. It would probably be fine to just fix it to not blow the stack, and perhaps add a note to its documentation that the list of directory contents is not streamed lazily. (Although note that eg, removeDirectoryRecursive uses getDirectoryContents and so can also unexpectedly use large amounts of memory..) I do wonder if conduit has a better way to handle this. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9266#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9266: getDirectoryContents blow its stack in a huge directory ----------------------------------------+---------------------------------- Reporter: joeyhess | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: libraries/directory | Version: 7.6.2 Resolution: | Keywords: Operating System: Linux | Architecture: Type of failure: None/Unknown | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: ----------------------------------------+---------------------------------- Comment (by joeyhess): One other idea: {{{ openDirectory :: FilePath -> IO DirHandle readDirectory :: DirHandle -> IO (Maybe FilePath) -- closes DirHandle automatically at end }}} If directory included that interface, then code that needs to stream the directory could easily do so. removeDirectoryRecursive could use that instead of getDirectoryContents and avoid ever using a lot of memory. It could also probably be used to make a conduit, etc. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9266#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9266: getDirectoryContents blow its stack in a huge directory ----------------------------------------+---------------------------------- Reporter: joeyhess | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: libraries/directory | Version: 7.6.2 Resolution: | Keywords: Operating System: Linux | Architecture: Type of failure: None/Unknown | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: ----------------------------------------+---------------------------------- Comment (by nomeata): I was under the impression that people tend to think badly about lazy IO these days, so I’m not sure if it makes sense to make `getDirectoryContents` lazy. Unless of course most people expect it to be. I think a way forward is to make `getDirectoryContents` non-lazy, but not blowing the stack (if that still is a problem on 7.8), and alternatively think about an interface for people who need more control over that. Maybe http://hackage.haskell.org/package/filesystem-conduit-1.0.0.2/docs/Data- Conduit-Filesystem.html is already that solution. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9266#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9266: getDirectoryContents blow its stack in a huge directory ----------------------------------------+---------------------------------- Reporter: joeyhess | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: libraries/directory | Version: 7.6.2 Resolution: | Keywords: Operating System: Linux | Architecture: Type of failure: None/Unknown | Unknown/Multiple Test Case: | Difficulty: Unknown Blocking: | Blocked By: | Related Tickets: ----------------------------------------+---------------------------------- Comment (by joeyhess): Data.Conduit.Filesystem uses listDirectory to generate a list, and loops over it instead. listDirectory comes from system-fileio, and on Windows just uses getDirectoryContents. On unix, it essentially re-implements getDirectoryContents (unsure why). So, it does not avoid buffering [FilePath] in memory. But, looking at Data.Conduit.Filesystem, it could certianly be changed to use the openDirectory/readDirectory interface prototyped above and avoid that problem. Essentially, it would: liftIO (readDirectory h) >>= yield In fact, system-fileio has internally a similar openDir/readDir/closeDir, although that interface is not exported. So, I think adding that low-level interface to directory or somewhere and using it in conduit etc would be a good plan. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9266#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9266: getDirectoryContents blow its stack in a huge directory -------------------------------------+------------------------------------- Reporter: joeyhess | Owner: ekmett Type: bug | Status: upstream Priority: normal | Milestone: Component: Core | Version: 7.6.2 Libraries | Keywords: Resolution: | Architecture: Unknown/Multiple Operating System: Linux | Difficulty: Unknown Type of failure: | Blocked By: None/Unknown | Related Tickets: Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by thomie): * cc: core-libraries-committee@… (added) * status: new => upstream Comment: joeyhess: you might have more luck submitting a pull request to http://github.com/haskell/directory, and poking someone from the [http://www.haskell.org/haskellwiki/Core_Libraries_Committee corelibcom] to look at it. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9266#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9266: getDirectoryContents blow its stack in a huge directory -------------------------------------+------------------------------------- Reporter: joeyhess | Owner: ekmett Type: bug | Status: upstream Priority: normal | Milestone: Component: Core Libraries | Version: 7.6.2 Resolution: | Keywords: Operating System: Linux | Architecture: Type of failure: None/Unknown | Unknown/Multiple Blocked By: | Test Case: Related Tickets: | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by snoyberg): For the record, recent versions of conduit-extra provide a proper streaming solution, e.g.: http://hackage.haskell.org/package/conduit-extra-1.1.6.2/docs/src/Data- Conduit-Filesystem.html#sourceDirectory This is based on non-conduit-specific code available in streaming-commons: http://hackage.haskell.org/package/streaming-commons-0.1.9.1/docs/Data- Streaming-Filesystem.html -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9266#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9266: getDirectoryContents blow its stack in a huge directory -------------------------------------+------------------------------------- Reporter: joeyhess | Owner: snoyberg Type: bug | Status: upstream Priority: normal | Milestone: Component: Core Libraries | Version: 7.6.2 Resolution: | Keywords: Operating System: Linux | Architecture: Type of failure: None/Unknown | Unknown/Multiple Blocked By: | Test Case: Related Tickets: | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by snoyberg): * owner: ekmett => snoyberg -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9266#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9266: getDirectoryContents blow its stack in a huge directory -------------------------------------+------------------------------------- Reporter: joeyhess | Owner: snoyberg Type: bug | Status: upstream Priority: normal | Milestone: Component: Core Libraries | Version: 7.6.2 Resolution: | Keywords: Operating System: Linux | Architecture: Type of failure: None/Unknown | Unknown/Multiple Blocked By: | Test Case: Related Tickets: | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by snoyberg): Reviewing your code, it looks incredibly similar to the code that I wrote in streaming-commons. I think it does make sense to include such a patch in directory, but that should likely be handled via a pull request there. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9266#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#9266: getDirectoryContents blow its stack in a huge directory -------------------------------------+------------------------------------- Reporter: joeyhess | Owner: snoyberg Type: bug | Status: closed Priority: normal | Milestone: Component: Core Libraries | Version: 7.6.2 Resolution: fixed | Keywords: Operating System: Linux | Architecture: Type of failure: None/Unknown | Unknown/Multiple Blocked By: | Test Case: Related Tickets: | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by snoyberg): * status: upstream => closed * resolution: => fixed Comment: I've implemented a fix for the stack overflow at: https://github.com/haskell/directory/pull/17 For the streaming interface: please send a pull request with your changes. I'm going to close this ticket as resolved. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/9266#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC