First call to putStr takes far more time than subsequent calls

This is a simplified version of an issue I'm running into when timing some functions. In this example, the first putStr inside the replicateM_ takes far longer than subsequent putStrs. If I add in a putStr at the top (the commented line), each iteration takes the same. (Though I presume I just shifted the long execution to the putStr on top). The same thing happens on the first iterations of a function when I need to evaluate some state, or deepseq some value. putStr was just a simpler version that still showed the same issue. Relevant code: http://lpaste.net/136250 What's going on here? Thanks! import Control.Monadimport Data.Time.Clock main :: IO ()main = do -- putStr "" replicateM_ 5 $ do t1 <- getCurrentTime putStr "" t2 <- getCurrentTime print . round $ 1000 * 1000 * diffUTCTime t2 t1 {- Sample output: 49 1 1 1 2 -}

On Thu, Jul 9, 2015 at 4:17 PM, Ben Gunton
In this example, the first putStr inside the replicateM_ takes far longer than subsequent putStrs. If I add in a putStr at the top (the commented line), each iteration takes the same. (Though I presume I just shifted the long execution to the putStr on top).
If I had to guess, delayed setup of the Handle associated with stdout. This will include setting up buffering (including querying the OS to find out whether output is a tty-like device or not, so as to enable or disable block buffering; note that this may be especially expensive on Windows, where there is IIRC no simple way to determine whether running in e.g. a cmd.exe window or not, as the OS has no concept of pseudo-ttys); also, setting up output encoding and such. Probably also other details. -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

The thing is, if I switch the "putStr """ in the replicateM to "evaluate $
force t1", the first iteration still takes way longer. Are there things to
setup even with that? Is there a way to pre-setup all the things that need
to be setup?
This is also on Linux, with GHC 7.10.1, and no optimizations.
On Thu, Jul 9, 2015 at 2:24 PM, Brandon Allbery
On Thu, Jul 9, 2015 at 4:17 PM, Ben Gunton
wrote: In this example, the first putStr inside the replicateM_ takes far longer than subsequent putStrs. If I add in a putStr at the top (the commented line), each iteration takes the same. (Though I presume I just shifted the long execution to the putStr on top).
If I had to guess, delayed setup of the Handle associated with stdout. This will include setting up buffering (including querying the OS to find out whether output is a tty-like device or not, so as to enable or disable block buffering; note that this may be especially expensive on Windows, where there is IIRC no simple way to determine whether running in e.g. a cmd.exe window or not, as the OS has no concept of pseudo-ttys); also, setting up output encoding and such. Probably also other details.
-- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

Here is the same thing with no console usage, just forcing the time value.
The first-iteration-delay seems to happen at the first IO-ish thing being
evaluated.
http://lpaste.net/136254
On Thu, Jul 9, 2015 at 2:26 PM, Ben Gunton
The thing is, if I switch the "putStr """ in the replicateM to "evaluate $ force t1", the first iteration still takes way longer. Are there things to setup even with that? Is there a way to pre-setup all the things that need to be setup?
This is also on Linux, with GHC 7.10.1, and no optimizations.
On Thu, Jul 9, 2015 at 2:24 PM, Brandon Allbery
wrote: On Thu, Jul 9, 2015 at 4:17 PM, Ben Gunton
wrote: In this example, the first putStr inside the replicateM_ takes far longer than subsequent putStrs. If I add in a putStr at the top (the commented line), each iteration takes the same. (Though I presume I just shifted the long execution to the putStr on top).
If I had to guess, delayed setup of the Handle associated with stdout. This will include setting up buffering (including querying the OS to find out whether output is a tty-like device or not, so as to enable or disable block buffering; note that this may be especially expensive on Windows, where there is IIRC no simple way to determine whether running in e.g. a cmd.exe window or not, as the OS has no concept of pseudo-ttys); also, setting up output encoding and such. Probably also other details.
-- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

This is a nice puzzle. However, I don't think it's related to IO. So far I've traced this to Data.Fixed methods. If you add evaluate $ floor (2/1 :: Fixed E12) before the loop, you'll find that the delay decreases significantly (although doesn't vanish entirely). How is this related to forcing a time value? Well, UTCTime represents the time of day as Fixed E12, and when you're evaluating the result of getCurrentTime, you're calling those methods, (/) and floor. Using other methods of Fixed E12 before the loop, or replacing E12 with E9, kills or at least hurts the effect. There might be some thunks referenced directly or indirectly by those methods, possibly involving specialization, but I don't see it so far. I'll let someone else to take it from here. On 10/07/15 00:07, Ben Gunton wrote:
Here is the same thing with no console usage, just forcing the time value. The first-iteration-delay seems to happen at the first IO-ish thing being evaluated.

When a module is imported, or used for the first time, are there things
that are not fully evaluated?
I see the same thing you do with the Data.Fixed methods (nice work by the
way!), but I also see similar (albeit less drastic) behaviour with other
datatypes, including some custom ones. It just always seems like the first
call to a modules method is always slower than subsequent calls. But I
can't for the life of me figure out why...
On Thu, Jul 9, 2015 at 5:45 PM, Roman Cheplyaka
This is a nice puzzle. However, I don't think it's related to IO.
So far I've traced this to Data.Fixed methods. If you add
evaluate $ floor (2/1 :: Fixed E12)
before the loop, you'll find that the delay decreases significantly (although doesn't vanish entirely).
How is this related to forcing a time value? Well, UTCTime represents the time of day as Fixed E12, and when you're evaluating the result of getCurrentTime, you're calling those methods, (/) and floor.
Using other methods of Fixed E12 before the loop, or replacing E12 with E9, kills or at least hurts the effect.
There might be some thunks referenced directly or indirectly by those methods, possibly involving specialization, but I don't see it so far.
I'll let someone else to take it from here.
On 10/07/15 00:07, Ben Gunton wrote:
Here is the same thing with no console usage, just forcing the time value. The first-iteration-delay seems to happen at the first IO-ish thing being evaluated.

On 10/07/15 04:34, Ben Gunton wrote:
When a module is imported, or used for the first time, are there things that are not fully evaluated?
Evaluation isn't tied to modules in any way. Modules are namespaces; they are strictly compile-time entities. Imagine that in your module you define at the top level fac10 :: Integer fac10 = product [1..10] This is known as CAF, a constant applicative form. If you never demand the value of fac10 in your program, it will never get evaluated. If you do demand the value of fac10, it will be evaluated the first time, and then fac10 will be modified to point to the actual value instead of a thunk, so all the subsequent calls will be much faster. Now, I suspect that there may be something like fac10 in Data.Fixed or perhaps one of its dependencies. It doesn't even have to be defined in the code: during optimization, a lot of things become CAFs when ghc finds something that looks like a constant and floats it out to the top level. Roman

If you look at the Fixed data type, it is implemented using the Integer type, which is again implemented using libgmp on my system. The first call to any function involving an Integer type thus needs to initialize libgmp first. Using ltrace it seems like libgmp first has to determine the processor it is running on and maybe set up other bookkeeping information. It could also be the case that libgmp itself has to be loaded at this point. Using strace it looks like it is mmapped into the memory, maybe linux is lazy and only reads the file on the disk at the moment it is accessed the first time (i.e. at the first call)? At least linux uses a similar approach for requested memory by a program. On 07/10/2015 01:45 AM, Roman Cheplyaka wrote:
This is a nice puzzle. However, I don't think it's related to IO.
So far I've traced this to Data.Fixed methods. If you add
evaluate $ floor (2/1 :: Fixed E12)
before the loop, you'll find that the delay decreases significantly (although doesn't vanish entirely).
How is this related to forcing a time value? Well, UTCTime represents the time of day as Fixed E12, and when you're evaluating the result of getCurrentTime, you're calling those methods, (/) and floor.
Using other methods of Fixed E12 before the loop, or replacing E12 with E9, kills or at least hurts the effect.
There might be some thunks referenced directly or indirectly by those methods, possibly involving specialization, but I don't see it so far.
I'll let someone else to take it from here.
On 10/07/15 00:07, Ben Gunton wrote:
Here is the same thing with no console usage, just forcing the time value. The first-iteration-delay seems to happen at the first IO-ish thing being evaluated.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

I considered that, but it doesn't seem to hold up. If you replace calling the Fixed functions with some operations on Integers, you won't reproduce the effect. On 10/07/15 10:17, Jonas Scholl wrote:
If you look at the Fixed data type, it is implemented using the Integer type, which is again implemented using libgmp on my system. The first call to any function involving an Integer type thus needs to initialize libgmp first. Using ltrace it seems like libgmp first has to determine the processor it is running on and maybe set up other bookkeeping information.
It could also be the case that libgmp itself has to be loaded at this point. Using strace it looks like it is mmapped into the memory, maybe linux is lazy and only reads the file on the disk at the moment it is accessed the first time (i.e. at the first call)? At least linux uses a similar approach for requested memory by a program.
On 07/10/2015 01:45 AM, Roman Cheplyaka wrote:
This is a nice puzzle. However, I don't think it's related to IO.
So far I've traced this to Data.Fixed methods. If you add
evaluate $ floor (2/1 :: Fixed E12)
before the loop, you'll find that the delay decreases significantly (although doesn't vanish entirely).
How is this related to forcing a time value? Well, UTCTime represents the time of day as Fixed E12, and when you're evaluating the result of getCurrentTime, you're calling those methods, (/) and floor.
Using other methods of Fixed E12 before the loop, or replacing E12 with E9, kills or at least hurts the effect.
There might be some thunks referenced directly or indirectly by those methods, possibly involving specialization, but I don't see it so far.
I'll let someone else to take it from here.
On 10/07/15 00:07, Ben Gunton wrote:
Here is the same thing with no console usage, just forcing the time value. The first-iteration-delay seems to happen at the first IO-ish thing being evaluated.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

the C allocation functions (malloc etc) spend significantly more time in
the first invocation. That may be related with this behaviour in libgmp, IO
etc
2015-07-10 9:31 GMT+02:00 Roman Cheplyaka
I considered that, but it doesn't seem to hold up. If you replace calling the Fixed functions with some operations on Integers, you won't reproduce the effect.
On 10/07/15 10:17, Jonas Scholl wrote:
If you look at the Fixed data type, it is implemented using the Integer type, which is again implemented using libgmp on my system. The first call to any function involving an Integer type thus needs to initialize libgmp first. Using ltrace it seems like libgmp first has to determine the processor it is running on and maybe set up other bookkeeping information.
It could also be the case that libgmp itself has to be loaded at this point. Using strace it looks like it is mmapped into the memory, maybe linux is lazy and only reads the file on the disk at the moment it is accessed the first time (i.e. at the first call)? At least linux uses a similar approach for requested memory by a program.
On 07/10/2015 01:45 AM, Roman Cheplyaka wrote:
This is a nice puzzle. However, I don't think it's related to IO.
So far I've traced this to Data.Fixed methods. If you add
evaluate $ floor (2/1 :: Fixed E12)
before the loop, you'll find that the delay decreases significantly (although doesn't vanish entirely).
How is this related to forcing a time value? Well, UTCTime represents the time of day as Fixed E12, and when you're evaluating the result of getCurrentTime, you're calling those methods, (/) and floor.
Using other methods of Fixed E12 before the loop, or replacing E12 with E9, kills or at least hurts the effect.
There might be some thunks referenced directly or indirectly by those methods, possibly involving specialization, but I don't see it so far.
I'll let someone else to take it from here.
On 10/07/15 00:07, Ben Gunton wrote:
Here is the same thing with no console usage, just forcing the time value. The first-iteration-delay seems to happen at the first IO-ish thing being evaluated.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- Alberto.

On Fri, Jul 10, 2015 at 10:17 AM, Jonas Scholl
It could also be the case that libgmp itself has to be loaded at this point. Using strace it looks like it is mmapped into the memory, maybe linux is lazy and only reads the file on the disk at the moment it is accessed the first time (i.e. at the first call)? At least linux uses a similar approach for requested memory by a program.
I can confirm Linux is lazy in loading program's code. It maps it in memory space, but doesn't load yet. That's also true for mmap.
participants (6)
-
Alberto G. Corona
-
Alexey Shmalko
-
Ben Gunton
-
Brandon Allbery
-
Jonas Scholl
-
Roman Cheplyaka