
Greetings -- I'm looking at several FP languages for data mining, and was annoyed to learn that Erlang represents each character as 8 BYTES in a string which is just a list of characters. Now I'm reading a Haskell book which states the same. Is there a more efficient Haskell string-handling method? Which functional language is the most suitable for text processing? Cheers, Alexy

On Mon, Jan 22, 2007 at 05:18:19PM -0800, Alexy Khrabrov wrote:
Greetings -- I'm looking at several FP languages for data mining, and was annoyed to learn that Erlang represents each character as 8 BYTES in a string which is just a list of characters. Now I'm reading a Haskell book which states the same.
The book is lying - the size of strings is unspecified and implementation dependant. In GHC String is 12 or 20 bytes per character, depending on construction details.
Is there a more efficient Haskell string-handling method?
Yes! Data.ByteString.* implements packed strings of bytes. They are less lazy, and don't support unicode, but they are small (8 bits / character) and fast (I have 100 MBy/s disks and my ByteString-based throwaway filters are IO-bound).
Which functional language is the most suitable for text processing?
If you expected any answer other than Haskell, you asked on the wrong list. :) Stefan

On Jan 22, 2007, at 7:18 PM, Alexy Khrabrov wrote:
Greetings -- I'm looking at several FP languages for data mining, and was annoyed to learn that Erlang represents each character as 8 BYTES in a string which is just a list of characters. Now I'm reading a Haskell book which states the same.
The standard string type in Haskell is indeed a linked list of characters, with about 12 bytes of overhead per character.
Is there a more efficient Haskell string-handling method?
Yes! There is a library called Data.ByteString [1], it is included with the latest versions of GHC and Hugs, and is also available as a standalone package. Data.ByteString represents strings as packed arrays of bytes, so the overhead is about 1 byte per character. This library exhibits fantastic performance, rivaling C's speed while maintaining the elegance of Haskell. Cheers, Spencer Janssen [1] http://www.cse.unsw.edu.au/~dons/fps.html

Hi Alexy,
Now I'm reading a Haskell book which states the same. Is there a more efficient Haskell string-handling method? Which functional language is the most suitable for text processing?
There are the Data.ByteString things, which are great, and have much less overhead. But remember that Haskell is lazy. If you are thinking "well I have to process a 50Mb file", remember that Haskell will lazily read and process this file, which substantially reduces the memory requirements so only a small portion will ever be in memory at a time. Thanks Neil
participants (4)
-
Alexy Khrabrov
-
Neil Mitchell
-
Spencer Janssen
-
Stefan O'Rear