Need help - my haskell code is over 50 times slower than equivalent perl implementation - Haskell-Cafe

newer
Haskell Wiki user account creation

Need help - my haskell code is over 50 times slower than equivalent perl implementation

C K Kashyap

24 Jun 2014 24 Jun '14

4:08 p.m.

Dear cafe, I have created a reproducible app here - https://github.com/ckkashyap/haskell-perf-repro Essentially I am trying to open a number of files and printing out their sized by reading them in and computing it's length. I have the equivalent perl program also there. I'd appreciate it very much if someone could take a look at my Haskell implementation and point out what I am doing wrong. Since it's over 50 times slower than the perl code, I assume I am doing something obviously incorrect. Regards, Kashyap

Attachments:

attachment.html (text/html — 725 bytes)

Show replies by date

Carter Schonwald

24 Jun 24 Jun

4:19 p.m.

use mapM you've got mapMI :: (a -> IO b) -> [a] -> IO [b] mapMI _ [] = return [] mapMI f (x:xs) = do y <- SIU.unsafeInterleaveIO (f x) ; ys <- SIU.unsafeInterleaveIO (mapMI f xs) ; return (y:ys) in your code, and thats going to make you sad On Tue, Jun 24, 2014 at 12:08 PM, C K Kashyap wrote:

...

Dear cafe,

I have created a reproducible app here - https://github.com/ckkashyap/haskell-perf-repro

Essentially I am trying to open a number of files and printing out their sized by reading them in and computing it's length.

I have the equivalent perl program also there. I'd appreciate it very much if someone could take a look at my Haskell implementation and point out what I am doing wrong. Since it's over 50 times slower than the perl code, I assume I am doing something obviously incorrect.

Regards, Kashyap

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Bryan O'Sullivan

4:19 p.m.

On Tue, Jun 24, 2014 at 9:08 AM, C K Kashyap wrote:

...

Essentially I am trying to open a number of files and printing out their sized by reading them in and computing it's length.

Don't use String, and don't use unsafeInterleaveIO unless you know what you're doing.

C K Kashyap

4:25 p.m.

Thanks Bryan, I used unsafeInterleaveIO after I ran into "too many open file handles" error. Again, I doubt about String since even if I change the number of files to 12000 - the perl program finishes in less than a second. Regards, Kashyap On Tue, Jun 24, 2014 at 9:49 PM, Bryan O'Sullivan wrote:

...

On Tue, Jun 24, 2014 at 9:08 AM, C K Kashyap wrote:

...
Essentially I am trying to open a number of files and printing out their sized by reading them in and computing it's length.

Don't use String, and don't use unsafeInterleaveIO unless you know what you're doing.

Brandon Allbery

4:29 p.m.

On Tue, Jun 24, 2014 at 12:25 PM, C K Kashyap wrote:

...

I used unsafeInterleaveIO after I ran into "too many open file handles" error.

That all by itself makes me think your problem is elsewhere and unsafeInterleaveIO is just covering it up. -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

C K Kashyap

4:32 p.m.

That is what is particularly frustrating...its a tiny program and somewhat trivial thing that I am trying to achieve. If I use mapM as suggested by others, I quickly run into - openFile: resource exhausted (Too many open files) Regards, Kashyap Regards, Kashyap On Tue, Jun 24, 2014 at 9:59 PM, Brandon Allbery wrote:

...

On Tue, Jun 24, 2014 at 12:25 PM, C K Kashyap wrote:

...
I used unsafeInterleaveIO after I ran into "too many open file handles" error.

That all by itself makes me think your problem is elsewhere and unsafeInterleaveIO is just covering it up.

-- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

Darren Grant

4:46 p.m.

It's one of those quirks arising from historical decisions that makes Haskell's non-strict idioms pathological where resources are limited. (A very real and common case.) You need very strict, possibly explicit, file handle management here. Look to APIs that deal with this problem in a way that suits your situation better: System.IO.Strict may be an option. Cheers, Darren On Jun 24, 2014 9:33 AM, "C K Kashyap" wrote:

...

That is what is particularly frustrating...its a tiny program and somewhat trivial thing that I am trying to achieve.

If I use mapM as suggested by others, I quickly run into -

openFile: resource exhausted (Too many open files)

Regards,

Kashyap

Regards, Kashyap

On Tue, Jun 24, 2014 at 9:59 PM, Brandon Allbery wrote:

...
On Tue, Jun 24, 2014 at 12:25 PM, C K Kashyap wrote:

...
I used unsafeInterleaveIO after I ran into "too many open file handles" error.

That all by itself makes me think your problem is elsewhere and unsafeInterleaveIO is just covering it up.

-- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Nicolas Trangez

5:05 p.m.

On Tue, 2014-06-24 at 22:02 +0530, C K Kashyap wrote:

...

If I use mapM as suggested by others, I quickly run into -

openFile: resource exhausted (Too many open files)

You should `seq` the calculated length before returning the value, otherwise the file descriptor needs to be kept open until the processing result is used (when printing the result list), and you exhaust your resources. Anyway, for giggles I added concurrency support using `async` and IO handling using `pipes`/`pipes-bytestring` to my fork in https://github.com/NicolasT/haskell-perf-repro/commit/86639daa3b577487e75d38... Using the sample dataset created by your Makefile, this version is somewhat slower than the previous (and your Perl script), but in case the dataset is huge (and depending on the block-device you're using, the layout of the files on it,... the lot) it might help (or might not). Nicolas

C K Kashyap

5:07 p.m.

Thanks Nicholas :) ... and if I understand right doing `seq` would be equivalent to Strict IO that Darren mentioned. right? Regards, Kashyap On Tue, Jun 24, 2014 at 10:35 PM, Nicolas Trangez wrote:

...

On Tue, 2014-06-24 at 22:02 +0530, C K Kashyap wrote:

...
If I use mapM as suggested by others, I quickly run into -

openFile: resource exhausted (Too many open files)

You should `seq` the calculated length before returning the value, otherwise the file descriptor needs to be kept open until the processing result is used (when printing the result list), and you exhaust your resources.

Anyway, for giggles I added concurrency support using `async` and IO handling using `pipes`/`pipes-bytestring` to my fork in

https://github.com/NicolasT/haskell-perf-repro/commit/86639daa3b577487e75d38... Using the sample dataset created by your Makefile, this version is somewhat slower than the previous (and your Perl script), but in case the dataset is huge (and depending on the block-device you're using, the layout of the files on it,... the lot) it might help (or might not).

Nicolas

Tom Ellis

5:14 p.m.

On Tue, Jun 24, 2014 at 10:37:35PM +0530, C K Kashyap wrote:

...

Thanks Nicholas :) ... and if I understand right doing `seq` would be equivalent to Strict IO that Darren mentioned. right?

It's probably best to think of them as distinct. Strict IO roughly means the whole file will be read at once so the file handle can be released. On the other hand `seq`ing the `length` forces the calculation of the length so the file is read completely and the file handle can be released. They have similar outcomes but the means of achieving them is different. Tom

C K Kashyap

25 Jun 25 Jun

1:56 a.m.

Thank you Tom, Niklas, Nicolas and Stuart, Calculating the length was just an illustration - what I really need to do is parse the file. Anyway, it is clear that String for IO is a bad idea - I wonder if examples of String based readFile etc for beginners is a good idea or not. Regards, Kashyap On Tue, Jun 24, 2014 at 10:44 PM, Tom Ellis < tom-lists-haskell-cafe-2013@jaguarpaw.co.uk> wrote:

...

On Tue, Jun 24, 2014 at 10:37:35PM +0530, C K Kashyap wrote:

...
Thanks Nicholas :) ... and if I understand right doing `seq` would be equivalent to Strict IO that Darren mentioned. right?

It's probably best to think of them as distinct.

Strict IO roughly means the whole file will be read at once so the file handle can be released. On the other hand `seq`ing the `length` forces the calculation of the length so the file is read completely and the file handle can be released. They have similar outcomes but the means of achieving them is different.

Tom _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Yitzchak Gale

1 Jul 1 Jul

12:18 p.m.

Hi Kashyap, I'm responding only now because I just saw this thread on reddit. You wrote:

...

it is clear that String for IO is a bad idea - I wonder if examples of String based readFile etc for beginners is a good idea

It's not always a bad idea. Especially for beginners. Conceptually, String is the simplest, and often the performance isn't so bad. For serious applications, though, you should use the text library. One thing you should never use for text is ByteString - that is for binary data, not text. Lazy IO is also OK in simple cases. But as you have seen, for real applications you need to be aware of its limitations. When there is interleaving of IO operations between multiple resources, the simplicity of lazy IO disappears very quickly. In those cases, you should consider alternative IO libraries, such as conduit and pipes. Regards, Yitz

Tom Ellis

24 Jun 24 Jun

4:33 p.m.

On Tue, Jun 24, 2014 at 12:29:38PM -0400, Brandon Allbery wrote:

...

On Tue, Jun 24, 2014 at 12:25 PM, C K Kashyap wrote:

...
I used unsafeInterleaveIO after I ran into "too many open file handles" error.

That all by itself makes me think your problem is elsewhere and unsafeInterleaveIO is just covering it up.

Yes, it's doing lazy IO.

Tom Ellis

4:32 p.m.

On Tue, Jun 24, 2014 at 09:55:21PM +0530, C K Kashyap wrote:

...

I used unsafeInterleaveIO after I ran into "too many open file handles" error. Again, I doubt about String since even if I change the number of files to 12000 - the perl program finishes in less than a second.

This pull request switching to ByteString brings performance in line with the Perl on my system: https://github.com/ckkashyap/haskell-perf-repro/pull/1 Tom

C K Kashyap

4:37 p.m.

Indeed it works. Thank you so much Tom. Sorry about the false alarm. Regards, Kashyap On Tue, Jun 24, 2014 at 10:02 PM, Tom Ellis < tom-lists-haskell-cafe-2013@jaguarpaw.co.uk> wrote:

...

On Tue, Jun 24, 2014 at 09:55:21PM +0530, C K Kashyap wrote:

...
I used unsafeInterleaveIO after I ran into "too many open file handles" error. Again, I doubt about String since even if I change the number of files to 12000 - the perl program finishes in less than a second.

This pull request switching to ByteString brings performance in line with the Perl on my system:

https://github.com/ckkashyap/haskell-perf-repro/pull/1

Tom _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Tom Ellis

4:41 p.m.

On Tue, Jun 24, 2014 at 10:07:32PM +0530, C K Kashyap wrote:

...

Indeed it works. Thank you so much Tom. Sorry about the false alarm.

Interestingly with `mapMI` it is twice as fast as with `mapM`. I haven't dug futher into that but at least you're getting reasonable speeds yet so you can decide how important that is to you. Tom

C K Kashyap

4:46 p.m.

Does mapM work for you? Don't you run into too many files open error? Regards, Kashyap On Tue, Jun 24, 2014 at 10:11 PM, Tom Ellis < tom-lists-haskell-cafe-2013@jaguarpaw.co.uk> wrote:

...

On Tue, Jun 24, 2014 at 10:07:32PM +0530, C K Kashyap wrote:

...
Indeed it works. Thank you so much Tom. Sorry about the false alarm.

Interestingly with `mapMI` it is twice as fast as with `mapM`. I haven't dug futher into that but at least you're getting reasonable speeds yet so you can decide how important that is to you.

Tom _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Tom Ellis

4:53 p.m.

On Tue, Jun 24, 2014 at 10:16:03PM +0530, C K Kashyap wrote:

...

Does mapM work for you? Don't you run into too many files open error?

It works. No error. But is slower that your mapMI. Undoubtedly there is a method that is better than `unsafeInterleaveIO` and `mapMI` in terms of both performance and conceptual coherence.

Tom Ellis

5:07 p.m.

On Tue, Jun 24, 2014 at 05:53:29PM +0100, Tom Ellis wrote:

...

On Tue, Jun 24, 2014 at 10:16:03PM +0530, C K Kashyap wrote:

...
Does mapM work for you? Don't you run into too many files open error?

It works. No error. [...]

Oh by the way, you should compile with `-O2` so that strictness analysis kicks in. Nicolas Trangez approaches this issue with an explicit `seq` instead. Tom

Stuart A. Kurtz

5:08 p.m.

New subject: Need help - my haskell code is over 50 times slower than equivalent perl implementation

Might it be more efficient to bypass counting altogether, and just use the Haskell version of stat? getFileSize :: FilePath -> IO Int64 getFileSize path = do stat <- getFileStatus path let COff fs = fileSize stat return fs Peace, Stu

Nicolas Trangez

5:10 p.m.

On Tue, 2014-06-24 at 12:08 -0500, Stuart A. Kurtz wrote:

...

Might it be more efficient to bypass counting altogether, and just use the Haskell version of stat?

getFileSize :: FilePath -> IO Int64 getFileSize path = do stat <- getFileStatus path let COff fs = fileSize stat return fs

Obviously. I assumed the original author would do something else while processing the file, not just count bytes... Nicolas

Niklas Haas

5:13 p.m.

On Tue, 24 Jun 2014 21:55:21 +0530, C K Kashyap wrote:

...

Again, I doubt about String since even if I change the number of files to 12000 - the perl program finishes in less than a second.

There are just so many things wrong with String and the String API, these are all colliding: 1. String is based on Char, which is a horribly inefficient representation of characters (a tagged machine word, possibly 64 bit, representing the UCS code point) 2. String is based on [], which are lazy linked lists - For each single character, you have a heap-allocated pointed to a heap-allocated Char and the heap-allocated rest of the list; traversing the String to compute its length involves going through the entire heap, dereferencing after every single step, and pattern matching on the (:) constructors (which also requires a branch), garbage collecting the unused intermediate structures along the way. 3. readFile is based on lazy IO, which means it opens the Handle and leaves it dangling, then returns some values that, when forced, will dynamically cause it to read some more bytes from the disk, allocate heap objects for them, write them in there, and leave the Handle dangling again. Only when you reach the end of the file does the Handle eventually automatically get garbage collected and the file therefore closed. (In theory; no practical guarantees on when it actually gets closed) 4. All the while, your program is churning through the files like mad, creating more and more dangling pointers that are just precariously hanging around all over the place. Contrast this to what the simple change to B.readFile and B.length does: ByteString is basically a pointer to some array of bytes in memory, together with its length. all B.readFile has to do is read in the bytes 1:1 without allocating anything other than a single fixed sized buffer to hold them all, the size of which is given by the file size. All B.length has to do is return the already-known size of the buffer. After the handle is opened and the contents read in, it's immediately closed. No garbage collection necessary other than eventually freeing up the buffer used to store its contents - a single operation. (In addition to the pointer itself, which is stored in the heap) As you can see, there is a complete world of difference between the two approaches on many levels - runtimes differing by one or even many orders of magnitude are no surprise. Some would go as far as to say that the String based file I/O APIs should be removed completely. Either way, you could make your operation even more efficient by not opening the files at all and just consulting their length through filesystem APIs.

Nicolas Trangez

4:38 p.m.

On Tue, 2014-06-24 at 21:38 +0530, C K Kashyap wrote:

...

Dear cafe,

I have created a reproducible app here - https://github.com/ckkashyap/haskell-perf-repro

Essentially I am trying to open a number of files and printing out their sized by reading them in and computing it's length.

I have the equivalent perl program also there. I'd appreciate it very much if someone could take a look at my Haskell implementation and point out what I am doing wrong. Since it's over 50 times slower than the perl code, I assume I am doing something obviously incorrect.

That's simply because your Haskell program is doing something completely different than your Perl code. Using `String` is the biggest offense. You're creating huge lists of `Char`s instead of working on arrays of bytes, like your Perl code does. I made some trivial changes to bring the Haskell code on par with your Perl code speed-wise at https://github.com/NicolasT/haskell-perf-repro/compare/ckkashyap:master...ma... Check it out! Nicolas

C K Kashyap

4:45 p.m.

Thanks Nicholas ... you've gotten rid of unsafeInterleaveIO ... It feels much better :) Regards, Kashyap On Tue, Jun 24, 2014 at 10:08 PM, Nicolas Trangez wrote:

...

On Tue, 2014-06-24 at 21:38 +0530, C K Kashyap wrote:

...
Dear cafe,

I have created a reproducible app here - https://github.com/ckkashyap/haskell-perf-repro

Essentially I am trying to open a number of files and printing out their sized by reading them in and computing it's length.

I have the equivalent perl program also there. I'd appreciate it very much if someone could take a look at my Haskell implementation and point out what I am doing wrong. Since it's over 50 times slower than the perl code, I assume I am doing something obviously incorrect.

That's simply because your Haskell program is doing something completely different than your Perl code.

Using `String` is the biggest offense. You're creating huge lists of `Char`s instead of working on arrays of bytes, like your Perl code does.

I made some trivial changes to bring the Haskell code on par with your Perl code speed-wise at

https://github.com/NicolasT/haskell-perf-repro/compare/ckkashyap:master...ma... Check it out!

Nicolas

4024

Age (days ago)

4031

Last active (days ago)

List overview

Download

23 comments

10 participants

participants (10)

Brandon Allbery
Bryan O'Sullivan
C K Kashyap
Carter Schonwald
Darren Grant
Nicolas Trangez
Niklas Haas
Stuart A. Kurtz
Tom Ellis
Yitzchak Gale