
On Tue, 24 Jun 2014 21:55:21 +0530, C K Kashyap
Again, I doubt about String since even if I change the number of files to 12000 - the perl program finishes in less than a second.
There are just so many things wrong with String and the String API, these are all colliding: 1. String is based on Char, which is a horribly inefficient representation of characters (a tagged machine word, possibly 64 bit, representing the UCS code point) 2. String is based on [], which are lazy linked lists - For each single character, you have a heap-allocated pointed to a heap-allocated Char and the heap-allocated rest of the list; traversing the String to compute its length involves going through the entire heap, dereferencing after every single step, and pattern matching on the (:) constructors (which also requires a branch), garbage collecting the unused intermediate structures along the way. 3. readFile is based on lazy IO, which means it opens the Handle and leaves it dangling, then returns some values that, when forced, will dynamically cause it to read some more bytes from the disk, allocate heap objects for them, write them in there, and leave the Handle dangling again. Only when you reach the end of the file does the Handle eventually automatically get garbage collected and the file therefore closed. (In theory; no practical guarantees on when it actually gets closed) 4. All the while, your program is churning through the files like mad, creating more and more dangling pointers that are just precariously hanging around all over the place. Contrast this to what the simple change to B.readFile and B.length does: ByteString is basically a pointer to some array of bytes in memory, together with its length. all B.readFile has to do is read in the bytes 1:1 without allocating anything other than a single fixed sized buffer to hold them all, the size of which is given by the file size. All B.length has to do is return the already-known size of the buffer. After the handle is opened and the contents read in, it's immediately closed. No garbage collection necessary other than eventually freeing up the buffer used to store its contents - a single operation. (In addition to the pointer itself, which is stored in the heap) As you can see, there is a complete world of difference between the two approaches on many levels - runtimes differing by one or even many orders of magnitude are no surprise. Some would go as far as to say that the String based file I/O APIs should be removed completely. Either way, you could make your operation even more efficient by not opening the files at all and just consulting their length through filesystem APIs.