Need help - my haskell code is over 50 times slower than equivalent perl implementation

Dear cafe, I have created a reproducible app here - https://github.com/ckkashyap/haskell-perf-repro Essentially I am trying to open a number of files and printing out their sized by reading them in and computing it's length. I have the equivalent perl program also there. I'd appreciate it very much if someone could take a look at my Haskell implementation and point out what I am doing wrong. Since it's over 50 times slower than the perl code, I assume I am doing something obviously incorrect. Regards, Kashyap

use mapM
you've got
mapMI :: (a -> IO b) -> [a] -> IO [b]
mapMI _ [] = return []
mapMI f (x:xs) = do y <- SIU.unsafeInterleaveIO (f x) ; ys <-
SIU.unsafeInterleaveIO (mapMI f xs) ; return (y:ys)
in your code, and thats going to make you sad
On Tue, Jun 24, 2014 at 12:08 PM, C K Kashyap
Dear cafe,
I have created a reproducible app here - https://github.com/ckkashyap/haskell-perf-repro
Essentially I am trying to open a number of files and printing out their sized by reading them in and computing it's length.
I have the equivalent perl program also there. I'd appreciate it very much if someone could take a look at my Haskell implementation and point out what I am doing wrong. Since it's over 50 times slower than the perl code, I assume I am doing something obviously incorrect.
Regards, Kashyap
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Thanks Bryan,
I used unsafeInterleaveIO after I ran into "too many open file handles"
error.
Again, I doubt about String since even if I change the number of files to
12000 - the perl program finishes in less than a second.
Regards,
Kashyap
On Tue, Jun 24, 2014 at 9:49 PM, Bryan O'Sullivan
On Tue, Jun 24, 2014 at 9:08 AM, C K Kashyap
wrote: Essentially I am trying to open a number of files and printing out their sized by reading them in and computing it's length.
Don't use String, and don't use unsafeInterleaveIO unless you know what you're doing.

On Tue, Jun 24, 2014 at 12:25 PM, C K Kashyap
I used unsafeInterleaveIO after I ran into "too many open file handles" error.
That all by itself makes me think your problem is elsewhere and unsafeInterleaveIO is just covering it up. -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

That is what is particularly frustrating...its a tiny program and somewhat
trivial thing that I am trying to achieve.
If I use mapM as suggested by others, I quickly run into -
openFile: resource exhausted (Too many open files)
Regards,
Kashyap
Regards,
Kashyap
On Tue, Jun 24, 2014 at 9:59 PM, Brandon Allbery
On Tue, Jun 24, 2014 at 12:25 PM, C K Kashyap
wrote: I used unsafeInterleaveIO after I ran into "too many open file handles" error.
That all by itself makes me think your problem is elsewhere and unsafeInterleaveIO is just covering it up.
-- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

It's one of those quirks arising from historical decisions that makes
Haskell's non-strict idioms pathological where resources are limited. (A
very real and common case.) You need very strict, possibly explicit, file
handle management here.
Look to APIs that deal with this problem in a way that suits your situation
better: System.IO.Strict may be an option.
Cheers,
Darren
On Jun 24, 2014 9:33 AM, "C K Kashyap"
That is what is particularly frustrating...its a tiny program and somewhat trivial thing that I am trying to achieve.
If I use mapM as suggested by others, I quickly run into -
openFile: resource exhausted (Too many open files)
Regards,
Kashyap
Regards, Kashyap
On Tue, Jun 24, 2014 at 9:59 PM, Brandon Allbery
wrote: On Tue, Jun 24, 2014 at 12:25 PM, C K Kashyap
wrote: I used unsafeInterleaveIO after I ran into "too many open file handles" error.
That all by itself makes me think your problem is elsewhere and unsafeInterleaveIO is just covering it up.
-- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Tue, 2014-06-24 at 22:02 +0530, C K Kashyap wrote:
If I use mapM as suggested by others, I quickly run into -
openFile: resource exhausted (Too many open files)
You should `seq` the calculated length before returning the value, otherwise the file descriptor needs to be kept open until the processing result is used (when printing the result list), and you exhaust your resources. Anyway, for giggles I added concurrency support using `async` and IO handling using `pipes`/`pipes-bytestring` to my fork in https://github.com/NicolasT/haskell-perf-repro/commit/86639daa3b577487e75d38... Using the sample dataset created by your Makefile, this version is somewhat slower than the previous (and your Perl script), but in case the dataset is huge (and depending on the block-device you're using, the layout of the files on it,... the lot) it might help (or might not). Nicolas

Thanks Nicholas :) ... and if I understand right doing `seq` would be
equivalent to Strict IO that Darren mentioned. right?
Regards,
Kashyap
On Tue, Jun 24, 2014 at 10:35 PM, Nicolas Trangez
On Tue, 2014-06-24 at 22:02 +0530, C K Kashyap wrote:
If I use mapM as suggested by others, I quickly run into -
openFile: resource exhausted (Too many open files)
You should `seq` the calculated length before returning the value, otherwise the file descriptor needs to be kept open until the processing result is used (when printing the result list), and you exhaust your resources.
Anyway, for giggles I added concurrency support using `async` and IO handling using `pipes`/`pipes-bytestring` to my fork in
https://github.com/NicolasT/haskell-perf-repro/commit/86639daa3b577487e75d38... Using the sample dataset created by your Makefile, this version is somewhat slower than the previous (and your Perl script), but in case the dataset is huge (and depending on the block-device you're using, the layout of the files on it,... the lot) it might help (or might not).
Nicolas

On Tue, Jun 24, 2014 at 10:37:35PM +0530, C K Kashyap wrote:
Thanks Nicholas :) ... and if I understand right doing `seq` would be equivalent to Strict IO that Darren mentioned. right?
It's probably best to think of them as distinct. Strict IO roughly means the whole file will be read at once so the file handle can be released. On the other hand `seq`ing the `length` forces the calculation of the length so the file is read completely and the file handle can be released. They have similar outcomes but the means of achieving them is different. Tom

Thank you Tom, Niklas, Nicolas and Stuart, Calculating the length was just an illustration - what I really need to do is parse the file. Anyway, it is clear that String for IO is a bad idea - I wonder if examples of String based readFile etc for beginners is a good idea or not. Regards, Kashyap On Tue, Jun 24, 2014 at 10:44 PM, Tom Ellis < tom-lists-haskell-cafe-2013@jaguarpaw.co.uk> wrote:
On Tue, Jun 24, 2014 at 10:37:35PM +0530, C K Kashyap wrote:
Thanks Nicholas :) ... and if I understand right doing `seq` would be equivalent to Strict IO that Darren mentioned. right?
It's probably best to think of them as distinct.
Strict IO roughly means the whole file will be read at once so the file handle can be released. On the other hand `seq`ing the `length` forces the calculation of the length so the file is read completely and the file handle can be released. They have similar outcomes but the means of achieving them is different.
Tom _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hi Kashyap, I'm responding only now because I just saw this thread on reddit. You wrote:
it is clear that String for IO is a bad idea - I wonder if examples of String based readFile etc for beginners is a good idea
It's not always a bad idea. Especially for beginners. Conceptually, String is the simplest, and often the performance isn't so bad. For serious applications, though, you should use the text library. One thing you should never use for text is ByteString - that is for binary data, not text. Lazy IO is also OK in simple cases. But as you have seen, for real applications you need to be aware of its limitations. When there is interleaving of IO operations between multiple resources, the simplicity of lazy IO disappears very quickly. In those cases, you should consider alternative IO libraries, such as conduit and pipes. Regards, Yitz

On Tue, Jun 24, 2014 at 12:29:38PM -0400, Brandon Allbery wrote:
On Tue, Jun 24, 2014 at 12:25 PM, C K Kashyap
wrote: I used unsafeInterleaveIO after I ran into "too many open file handles" error.
That all by itself makes me think your problem is elsewhere and unsafeInterleaveIO is just covering it up.
Yes, it's doing lazy IO.

On Tue, Jun 24, 2014 at 09:55:21PM +0530, C K Kashyap wrote:
I used unsafeInterleaveIO after I ran into "too many open file handles" error. Again, I doubt about String since even if I change the number of files to 12000 - the perl program finishes in less than a second.
This pull request switching to ByteString brings performance in line with the Perl on my system: https://github.com/ckkashyap/haskell-perf-repro/pull/1 Tom

Indeed it works. Thank you so much Tom. Sorry about the false alarm. Regards, Kashyap On Tue, Jun 24, 2014 at 10:02 PM, Tom Ellis < tom-lists-haskell-cafe-2013@jaguarpaw.co.uk> wrote:
On Tue, Jun 24, 2014 at 09:55:21PM +0530, C K Kashyap wrote:
I used unsafeInterleaveIO after I ran into "too many open file handles" error. Again, I doubt about String since even if I change the number of files to 12000 - the perl program finishes in less than a second.
This pull request switching to ByteString brings performance in line with the Perl on my system:
https://github.com/ckkashyap/haskell-perf-repro/pull/1
Tom _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Tue, Jun 24, 2014 at 10:07:32PM +0530, C K Kashyap wrote:
Indeed it works. Thank you so much Tom. Sorry about the false alarm.
Interestingly with `mapMI` it is twice as fast as with `mapM`. I haven't dug futher into that but at least you're getting reasonable speeds yet so you can decide how important that is to you. Tom

Does mapM work for you? Don't you run into too many files open error? Regards, Kashyap On Tue, Jun 24, 2014 at 10:11 PM, Tom Ellis < tom-lists-haskell-cafe-2013@jaguarpaw.co.uk> wrote:
On Tue, Jun 24, 2014 at 10:07:32PM +0530, C K Kashyap wrote:
Indeed it works. Thank you so much Tom. Sorry about the false alarm.
Interestingly with `mapMI` it is twice as fast as with `mapM`. I haven't dug futher into that but at least you're getting reasonable speeds yet so you can decide how important that is to you.
Tom _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Tue, Jun 24, 2014 at 10:16:03PM +0530, C K Kashyap wrote:
Does mapM work for you? Don't you run into too many files open error?
It works. No error. But is slower that your mapMI. Undoubtedly there is a method that is better than `unsafeInterleaveIO` and `mapMI` in terms of both performance and conceptual coherence.

On Tue, Jun 24, 2014 at 05:53:29PM +0100, Tom Ellis wrote:
On Tue, Jun 24, 2014 at 10:16:03PM +0530, C K Kashyap wrote:
Does mapM work for you? Don't you run into too many files open error?
It works. No error. [...]
Oh by the way, you should compile with `-O2` so that strictness analysis kicks in. Nicolas Trangez approaches this issue with an explicit `seq` instead. Tom

Might it be more efficient to bypass counting altogether, and just use the Haskell version of stat? getFileSize :: FilePath -> IO Int64 getFileSize path = do stat <- getFileStatus path let COff fs = fileSize stat return fs Peace, Stu

On Tue, 2014-06-24 at 12:08 -0500, Stuart A. Kurtz wrote:
Might it be more efficient to bypass counting altogether, and just use the Haskell version of stat?
getFileSize :: FilePath -> IO Int64 getFileSize path = do stat <- getFileStatus path let COff fs = fileSize stat return fs
Obviously. I assumed the original author would do something else while processing the file, not just count bytes... Nicolas

On Tue, 24 Jun 2014 21:55:21 +0530, C K Kashyap
Again, I doubt about String since even if I change the number of files to 12000 - the perl program finishes in less than a second.
There are just so many things wrong with String and the String API, these are all colliding: 1. String is based on Char, which is a horribly inefficient representation of characters (a tagged machine word, possibly 64 bit, representing the UCS code point) 2. String is based on [], which are lazy linked lists - For each single character, you have a heap-allocated pointed to a heap-allocated Char and the heap-allocated rest of the list; traversing the String to compute its length involves going through the entire heap, dereferencing after every single step, and pattern matching on the (:) constructors (which also requires a branch), garbage collecting the unused intermediate structures along the way. 3. readFile is based on lazy IO, which means it opens the Handle and leaves it dangling, then returns some values that, when forced, will dynamically cause it to read some more bytes from the disk, allocate heap objects for them, write them in there, and leave the Handle dangling again. Only when you reach the end of the file does the Handle eventually automatically get garbage collected and the file therefore closed. (In theory; no practical guarantees on when it actually gets closed) 4. All the while, your program is churning through the files like mad, creating more and more dangling pointers that are just precariously hanging around all over the place. Contrast this to what the simple change to B.readFile and B.length does: ByteString is basically a pointer to some array of bytes in memory, together with its length. all B.readFile has to do is read in the bytes 1:1 without allocating anything other than a single fixed sized buffer to hold them all, the size of which is given by the file size. All B.length has to do is return the already-known size of the buffer. After the handle is opened and the contents read in, it's immediately closed. No garbage collection necessary other than eventually freeing up the buffer used to store its contents - a single operation. (In addition to the pointer itself, which is stored in the heap) As you can see, there is a complete world of difference between the two approaches on many levels - runtimes differing by one or even many orders of magnitude are no surprise. Some would go as far as to say that the String based file I/O APIs should be removed completely. Either way, you could make your operation even more efficient by not opening the files at all and just consulting their length through filesystem APIs.

On Tue, 2014-06-24 at 21:38 +0530, C K Kashyap wrote:
Dear cafe,
I have created a reproducible app here - https://github.com/ckkashyap/haskell-perf-repro
Essentially I am trying to open a number of files and printing out their sized by reading them in and computing it's length.
I have the equivalent perl program also there. I'd appreciate it very much if someone could take a look at my Haskell implementation and point out what I am doing wrong. Since it's over 50 times slower than the perl code, I assume I am doing something obviously incorrect.
That's simply because your Haskell program is doing something completely different than your Perl code. Using `String` is the biggest offense. You're creating huge lists of `Char`s instead of working on arrays of bytes, like your Perl code does. I made some trivial changes to bring the Haskell code on par with your Perl code speed-wise at https://github.com/NicolasT/haskell-perf-repro/compare/ckkashyap:master...ma... Check it out! Nicolas

Thanks Nicholas ... you've gotten rid of unsafeInterleaveIO ...
It feels much better :)
Regards,
Kashyap
On Tue, Jun 24, 2014 at 10:08 PM, Nicolas Trangez
On Tue, 2014-06-24 at 21:38 +0530, C K Kashyap wrote:
Dear cafe,
I have created a reproducible app here - https://github.com/ckkashyap/haskell-perf-repro
Essentially I am trying to open a number of files and printing out their sized by reading them in and computing it's length.
I have the equivalent perl program also there. I'd appreciate it very much if someone could take a look at my Haskell implementation and point out what I am doing wrong. Since it's over 50 times slower than the perl code, I assume I am doing something obviously incorrect.
That's simply because your Haskell program is doing something completely different than your Perl code.
Using `String` is the biggest offense. You're creating huge lists of `Char`s instead of working on arrays of bytes, like your Perl code does.
I made some trivial changes to bring the Haskell code on par with your Perl code speed-wise at
https://github.com/NicolasT/haskell-perf-repro/compare/ckkashyap:master...ma... Check it out!
Nicolas
participants (10)
-
Brandon Allbery
-
Bryan O'Sullivan
-
C K Kashyap
-
Carter Schonwald
-
Darren Grant
-
Nicolas Trangez
-
Niklas Haas
-
Stuart A. Kurtz
-
Tom Ellis
-
Yitzchak Gale