HEAD: Deterioration in ByteString I/O

Trying out HEAD (specifically, ghc-6.13.20100831-src.tar.bz2 built with 6.12.3) investigating an issue with the text package, I found that I/O of ByteStrings has become significantly slower (on my machine at least: $ uname -a Linux linux-mkk1 2.6.27.48-0.2-pae #1 SMP 2010-07-29 20:06:52 +0200 i686 i686 i386 GNU/Linux Pentium 4, 3.06GHz). Timings for reading and outputting a 74.3MB file: cat: $ time cat bigfile > /dev/null 0.00user 0.04system 0:00.06elapsed 83%CPU ghc-6.12.3: $ time ./nbench lazyBSNull bigfile a b > /dev/null 0.01user 0.09system 0:00.10elapsed 100%CPU ghc-6.13.20100831: $ time ./hdbench lazyBSNull bigfile a b > /dev/null 0.07user 0.10system 0:00.18elapsed 96%CPU In addition to the slowdown, the allocation behaviour has become quite bad: ghc-6.12.3: 89,330,672 bytes allocated in the heap 15,092 bytes copied during GC 35,980 bytes maximum residency (1 sample(s)) 29,556 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) ghc-6.13.20100831: 475,305,720 bytes allocated in the heap 89,272 bytes copied during GC 68,860 bytes maximum residency (1 sample(s)) 29,444 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation)

daniel.is.fischer:
Trying out HEAD (specifically, ghc-6.13.20100831-src.tar.bz2 built with 6.12.3) investigating an issue with the text package, I found that I/O of ByteStrings has become significantly slower (on my machine at least:
$ uname -a Linux linux-mkk1 2.6.27.48-0.2-pae #1 SMP 2010-07-29 20:06:52 +0200 i686 i686 i386 GNU/Linux
Pentium 4, 3.06GHz).
Timings for reading and outputting a 74.3MB file:
cat: $ time cat bigfile > /dev/null 0.00user 0.04system 0:00.06elapsed 83%CPU
ghc-6.12.3: $ time ./nbench lazyBSNull bigfile a b > /dev/null 0.01user 0.09system 0:00.10elapsed 100%CPU
ghc-6.13.20100831: $ time ./hdbench lazyBSNull bigfile a b > /dev/null 0.07user 0.10system 0:00.18elapsed 96%CPU
In addition to the slowdown, the allocation behaviour has become quite bad:
ghc-6.12.3: 89,330,672 bytes allocated in the heap 15,092 bytes copied during GC 35,980 bytes maximum residency (1 sample(s)) 29,556 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation)
ghc-6.13.20100831: 475,305,720 bytes allocated in the heap 89,272 bytes copied during GC 68,860 bytes maximum residency (1 sample(s)) 29,444 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation)
Can you put your benchmark code somewhere? Likely a GHC regression. -- Don

On Wednesday 08 September 2010 18:10:26, Don Stewart wrote:
Can you put your benchmark code somewhere?
Boiled down to the bare minimum, module Main (main) where import System.Environment (getArgs) import qualified Data.ByteString.Lazy as L main :: IO () main = do (file : _) <- getArgs L.readFile file >>= L.putStr Then all you need is a file of nontrivial size (a few 10KB is enough to show it).
Likely a GHC regression.
That's what I think.

| > ghc-6.12.3: | > 89,330,672 bytes allocated in the heap | > 15,092 bytes copied during GC | > 35,980 bytes maximum residency (1 sample(s)) | > 29,556 bytes maximum slop | > 2 MB total memory in use (0 MB lost due to fragmentation) | > | > ghc-6.13.20100831: | > 475,305,720 bytes allocated in the heap | > 89,272 bytes copied during GC | > 68,860 bytes maximum residency (1 sample(s)) | > 29,444 bytes maximum slop | > 2 MB total memory in use (0 MB lost due to fragmentation) | | Can you put your benchmark code somewhere? Likely a GHC regression. Indeed bad. If someone could characterise the regression more precisely (e.g. "fusion isn't happening here") that would be jolly helpful. Simon

simonpj:
| > ghc-6.12.3: | > 89,330,672 bytes allocated in the heap | > 15,092 bytes copied during GC | > 35,980 bytes maximum residency (1 sample(s)) | > 29,556 bytes maximum slop | > 2 MB total memory in use (0 MB lost due to fragmentation) | > | > ghc-6.13.20100831: | > 475,305,720 bytes allocated in the heap | > 89,272 bytes copied during GC | > 68,860 bytes maximum residency (1 sample(s)) | > 29,444 bytes maximum slop | > 2 MB total memory in use (0 MB lost due to fragmentation) | | Can you put your benchmark code somewhere? Likely a GHC regression.
Indeed bad. If someone could characterise the regression more precisely (e.g. "fusion isn't happening here") that would be jolly helpful.
Shouldn't be fusion. Is this a straight IO function. Something to do with buffering/encoding?

On Wednesday 08 September 2010 23:55:35, Don Stewart wrote:
simonpj:
| > ghc-6.12.3: | > 89,330,672 bytes allocated in the heap | > 15,092 bytes copied during GC | > 35,980 bytes maximum residency (1 sample(s)) | > 29,556 bytes maximum slop | > 2 MB total memory in use (0 MB lost due to | > fragmentation) | > | > ghc-6.13.20100831: | > 475,305,720 bytes allocated in the heap | > 89,272 bytes copied during GC | > 68,860 bytes maximum residency (1 sample(s)) | > 29,444 bytes maximum slop | > 2 MB total memory in use (0 MB lost due to | > fragmentation) | | Can you put your benchmark code somewhere? Likely a GHC | regression.
Indeed bad. If someone could characterise the regression more precisely (e.g. "fusion isn't happening here") that would be jolly helpful.
Shouldn't be fusion. Is this a straight IO function. Something to do with buffering/encoding?
Maybe the following observation helps: ghc-6.13.20100831 reads lazy ByteStrings in chunks of 8192 bytes. If I understand correctly, that means (since defaultChunkSize = 32760) - bytestring allocates a 32K buffer to be filled and asks ghc for 32760 bytes in that buffer - ghc asks the OS for 8192 bytes (and usually gets them) - upon receiving fewer bytes than requested, bytestring copies them to a new smaller buffer - since the number of bytes received is a multiple of ghc's allocation block size (which I believe is 4K), there's no space for the bookkeeping overhead, hence the new buffer takes up 12K instead of 8, resulting in 44K allocation for 8K bytes That factor of 5.5 corresponds pretty well with the allocation figures above, and the extra copying explains the approximate doubling of I/O time. Trying to find out why ghc asks the OS for only 8192 bytes instead of 32760 hasn't brought enlightenment yet. Cheers, Daniel Excerpt of strace log: read(3, "%!PS-Adobe-2.0\n%%Title: nbench\n%"..., 8192) = 8192 open("/usr/lib/gconv/UTF-32.so", O_RDONLY) = 4 read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0`\4\0\0004\0\0\0"..., 512) = 512 fstat64(4, {st_mode=S_IFREG|0755, st_size=9672, ...}) = 0 mmap2(NULL, 12328, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 0xb7852000 fadvise64(4, 0, 12328, POSIX_FADV_WILLNEED) = 0 mmap2(0xb7854000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED| MAP_DENYWRITE, 4, 0x1) = 0xb7854000 close(4) = 0 mprotect(0xb7854000, 4096, PROT_READ) = 0 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbff37cb0) = -1 ENOTTY (Inappropriate ioctl for device) select(2, [], [1], NULL, {0, 0}) = 1 (out [1], left {0, 0}) write(1, "", 0) = 0 select(2, [], [1], NULL, {0, 0}) = 1 (out [1], left {0, 0}) write(1, "%!PS-Adobe-2.0\n%%Title: nbench\n%"..., 8192) = 8192 read(3, " 20.000000 lineto\n121.153524 20."..., 8192) = 8192 select(2, [], [1], NULL, {0, 0}) = 1 (out [1], left {0, 0}) write(1, "", 0) = 0 select(2, [], [1], NULL, {0, 0}) = 1 (out [1], left {0, 0}) write(1, " 20.000000 lineto\n121.153524 20."..., 8192) = 8192 read(3, "30.542315 21.394403 lineto\n125.3"..., 8192) = 8192 select(2, [], [1], NULL, {0, 0}) = 1 (out [1], left {0, 0}) write(1, "", 0) = 0 select(2, [], [1], NULL, {0, 0}) = 1 (out [1], left {0, 0}) write(1, "30.542315 21.394403 lineto\n125.3"..., 8192) = 8192 read(3, "neto\n308.929337 21.969871 lineto"..., 8192) = 8192

On Thursday 09 September 2010 01:28:04, Daniel Fischer wrote:
Maybe the following observation helps:
ghc-6.13.20100831 reads lazy ByteStrings in chunks of 8192 bytes.
If I understand correctly, that means (since defaultChunkSize = 32760) - bytestring allocates a 32K buffer to be filled and asks ghc for 32760 bytes in that buffer - ghc asks the OS for 8192 bytes (and usually gets them) - upon receiving fewer bytes than requested, bytestring copies them to a new smaller buffer - since the number of bytes received is a multiple of ghc's allocation block size (which I believe is 4K), there's no space for the bookkeeping overhead, hence the new buffer takes up 12K instead of 8, resulting in 44K allocation for 8K bytes
That factor of 5.5 corresponds pretty well with the allocation figures above,
That seems to be correct, but probably not the whole story. I've played with defaultChunkSize, setting it to (64K - overhead), ghc still reads in 8192 byte chunks, the allocation figures are nearly double those for (32K - overhead). Setting it to (8K - overhead), ghc reads in 8184 byte chunks and the allocation figures go down to approximately 1.4 times those of 6.12.3. Can a factor of 1.4 be explained by the smaller chunk size or is something else going on?
and the extra copying explains the approximate doubling of I/O time.
Apparently not. With the small chunk size which should avoid copying, the I/O didn't get faster.
Trying to find out why ghc asks the OS for only 8192 bytes instead of 32760 hasn't brought enlightenment yet.
No progress on that front.

Hello Daniel, Thursday, September 9, 2010, 3:28:04 AM, you wrote:
- bytestring allocates a 32K buffer to be filled and asks ghc for 32760 bytes in that buffer - ghc asks the OS for 8192 bytes (and usually gets them)
btw, we made benchmarking that shown that the most efficient read/write chunk in Windows Vista is 32-64 kb one -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 09/09/2010 00:28, Daniel Fischer wrote:
On Wednesday 08 September 2010 23:55:35, Don Stewart wrote:
simonpj:
|> ghc-6.12.3: |> 89,330,672 bytes allocated in the heap |> 15,092 bytes copied during GC |> 35,980 bytes maximum residency (1 sample(s)) |> 29,556 bytes maximum slop |> 2 MB total memory in use (0 MB lost due to |> fragmentation) |> |> ghc-6.13.20100831: |> 475,305,720 bytes allocated in the heap |> 89,272 bytes copied during GC |> 68,860 bytes maximum residency (1 sample(s)) |> 29,444 bytes maximum slop |> 2 MB total memory in use (0 MB lost due to |> fragmentation) | | Can you put your benchmark code somewhere? Likely a GHC | regression.
Indeed bad. If someone could characterise the regression more precisely (e.g. "fusion isn't happening here") that would be jolly helpful.
Shouldn't be fusion. Is this a straight IO function. Something to do with buffering/encoding?
Maybe the following observation helps:
ghc-6.13.20100831 reads lazy ByteStrings in chunks of 8192 bytes.
If I understand correctly, that means (since defaultChunkSize = 32760) - bytestring allocates a 32K buffer to be filled and asks ghc for 32760 bytes in that buffer - ghc asks the OS for 8192 bytes (and usually gets them) - upon receiving fewer bytes than requested, bytestring copies them to a new smaller buffer - since the number of bytes received is a multiple of ghc's allocation block size (which I believe is 4K), there's no space for the bookkeeping overhead, hence the new buffer takes up 12K instead of 8, resulting in 44K allocation for 8K bytes
That factor of 5.5 corresponds pretty well with the allocation figures above, and the extra copying explains the approximate doubling of I/O time.
Trying to find out why ghc asks the OS for only 8192 bytes instead of 32760 hasn't brought enlightenment yet.
I think I've found the problem, GHC.IO.Handle.Text: bufReadNBEmpty :: Handle__ -> Buffer Word8 -> Ptr Word8 -> Int -> Int -> IO Int bufReadNBEmpty h_@Handle__{..} buf@Buffer{ bufRaw=raw, bufR=w, bufL=r, bufSize=sz } ptr so_far count | count > sz, False, Just fd <- cast haDevice = do m <- RawIO.readNonBlocking (fd::FD) ptr count case m of Nothing -> return so_far Just n -> return (so_far + n) See if you can spot it. I must have made this change for debugging, and forgot to revert it. d'oh! Cheers, Simon
Cheers, Daniel
Excerpt of strace log:
read(3, "%!PS-Adobe-2.0\n%%Title: nbench\n%"..., 8192) = 8192 open("/usr/lib/gconv/UTF-32.so", O_RDONLY) = 4 read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0`\4\0\0004\0\0\0"..., 512) = 512 fstat64(4, {st_mode=S_IFREG|0755, st_size=9672, ...}) = 0 mmap2(NULL, 12328, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 0xb7852000 fadvise64(4, 0, 12328, POSIX_FADV_WILLNEED) = 0 mmap2(0xb7854000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED| MAP_DENYWRITE, 4, 0x1) = 0xb7854000 close(4) = 0 mprotect(0xb7854000, 4096, PROT_READ) = 0 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbff37cb0) = -1 ENOTTY (Inappropriate ioctl for device) select(2, [], [1], NULL, {0, 0}) = 1 (out [1], left {0, 0}) write(1, "", 0) = 0 select(2, [], [1], NULL, {0, 0}) = 1 (out [1], left {0, 0}) write(1, "%!PS-Adobe-2.0\n%%Title: nbench\n%"..., 8192) = 8192 read(3, " 20.000000 lineto\n121.153524 20."..., 8192) = 8192 select(2, [], [1], NULL, {0, 0}) = 1 (out [1], left {0, 0}) write(1, "", 0) = 0 select(2, [], [1], NULL, {0, 0}) = 1 (out [1], left {0, 0}) write(1, " 20.000000 lineto\n121.153524 20."..., 8192) = 8192 read(3, "30.542315 21.394403 lineto\n125.3"..., 8192) = 8192 select(2, [], [1], NULL, {0, 0}) = 1 (out [1], left {0, 0}) write(1, "", 0) = 0 select(2, [], [1], NULL, {0, 0}) = 1 (out [1], left {0, 0}) write(1, "30.542315 21.394403 lineto\n125.3"..., 8192) = 8192 read(3, "neto\n308.929337 21.969871 lineto"..., 8192) = 8192 _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

On Thursday 09 September 2010 13:19:23, Simon Marlow wrote:
I think I've found the problem, GHC.IO.Handle.Text:
bufReadNBEmpty :: Handle__ -> Buffer Word8 -> Ptr Word8 -> Int -> Int -> IO Int bufReadNBEmpty h_@Handle__{..} buf@Buffer{ bufRaw=raw, bufR=w, bufL=r, bufSize=sz } ptr so_far count | count > sz, False, Just fd <- cast haDevice = do m <- RawIO.readNonBlocking (fd::FD) ptr count case m of Nothing -> return so_far Just n -> return (so_far + n)
See if you can spot it.
Yes, that's it. Removing the literal False to make that branch reachable more or less reinstates old behaviour. For I/O of (lazy) ByteStrings only, the allocation figures of HEAD are consistently slightly higher than those of 6.12.3, but the difference is less than 1%, well within the normal fluctuation due to changes in implementation. Timings seem to be identical. When performing work on the ByteStrings (finding/replacing substrings), however, things change a bit. The allocation figures observed so far range from almost identical (< 1% difference) to above 15% higher (90,146,508 bytes allocated vs. 106,237,456), most of the differences I observed so far are between 5% and 10%. The wall clock time (elapsed, per +RTS -s or time) seems to be identical (very stable timings for multiple runs of the same benchmark), but the MUT times reported by +RTS -s differ for some benchmarks (between 10% less for HEAD and 20% more observed, but identical for most). That might be worthy of examination, though it's not alarming. Cheers, Daniel

On 09/09/2010 15:08, Daniel Fischer wrote:
On Thursday 09 September 2010 13:19:23, Simon Marlow wrote:
I think I've found the problem, GHC.IO.Handle.Text:
bufReadNBEmpty :: Handle__ -> Buffer Word8 -> Ptr Word8 -> Int -> Int -> IO Int bufReadNBEmpty h_@Handle__{..} buf@Buffer{ bufRaw=raw, bufR=w, bufL=r, bufSize=sz } ptr so_far count | count> sz, False, Just fd<- cast haDevice = do m<- RawIO.readNonBlocking (fd::FD) ptr count case m of Nothing -> return so_far Just n -> return (so_far + n)
See if you can spot it.
Yes, that's it. Removing the literal False to make that branch reachable more or less reinstates old behaviour.
For I/O of (lazy) ByteStrings only, the allocation figures of HEAD are consistently slightly higher than those of 6.12.3, but the difference is less than 1%, well within the normal fluctuation due to changes in implementation. Timings seem to be identical.
When performing work on the ByteStrings (finding/replacing substrings), however, things change a bit. The allocation figures observed so far range from almost identical (< 1% difference) to above 15% higher (90,146,508 bytes allocated vs. 106,237,456), most of the differences I observed so far are between 5% and 10%. The wall clock time (elapsed, per +RTS -s or time) seems to be identical (very stable timings for multiple runs of the same benchmark), but the MUT times reported by +RTS -s differ for some benchmarks (between 10% less for HEAD and 20% more observed, but identical for most).
That might be worthy of examination, though it's not alarming.
If you can find an example that consistently allocates significantly more than with 6.12 (10%+), then I'd say it would be worth looking into. Cheers, Simon
participants (5)
-
Bulat Ziganshin
-
Daniel Fischer
-
Don Stewart
-
Simon Marlow
-
Simon Peyton-Jones