interesting IO oriented code

17 Mar 2005

      I while back following the most recent discussion about filepaths and IO 
generally, I decided to pick up the torch and try my hand at a solution 
that delt with all the issues.  My conception of the problem can be 
found here:

http://www.haskell.org//pipermail/haskell-cafe/2005-January/008955.html

However, I quickly got sidetracked by the character encoding issue, and 
sequential IO operations generally.  I have come up with some code that 
I think is interesting, not least of which because it is somewhat 
performant.  In the spirit of 'release early, release often' I am now 
making it available.

I ended up developing a sequential IO API and a couple of test cases to 
drive it.  I currently have a simple word count and a md5 
implementation.  There is a darcs repo at:

http://www.eecs.tufts.edu/~rdocki01/filepath/

The code relies on multi-parameter typeclasses, fundeps and unboxed 
tuples, so its GHC only.

The good news:

1) The API is semi-manageable.  It is based around Producer, Transformer 
and Consumer functors (at least I think they can be called functors; I'm 
not real knowledgeable here).  I think that with a little more work (and 
the input of people more experienced at this stuff than I) it could be 
sugared into a pretty usable API.

2) It is performant (mostly).  At least it outperforms other Haskell IO 
methods I have tried.  My 'wc' is about twice as fast as the current 
shootout version in informal tests (the shootout code is included in the 
repo).  My md5 can sum somewhere between 2-4Mb/Sec on my hardware.

3) Very importantly, the code appears to have reliable constant-space 
behavior.

4) We are not artificially forced into the IO or ST monads for 
performance reasons.  Explicit state passing is used where necessary, 
which seems to have the added benefit of helping the compiler to find 
good optimizations (pure speculation).

The bad news:

1) Mostly performant is still not great.  My wc takes about 6 times as 
long as the C version on my machine, and md5 takes about about 80 times 
as long.  (interestingly, the C md5sum takes about 1/5 of the time as C 
wc!).  However, it is within an order of magnitude of a java 
implementation of md5 (using the standard digest classes).

2) The performance is pretty fragile.  Small changes can cause large 
performance hits for no easily discernible reason.  This probably 
relates to the way particular optimizations do or don't get applied. 
The situation is hugely complicated by the typeclass-heavy API, I am sure.

3) The stream-observer paradigm is a somewhat difficult programming 
environment.

My next step is to try some actual character encoding implementations 
(the original purpose after all) and see how that goes.  I'd also like 
to try gzip and gunzip transformer layers.

Any ideas for improvements (including patches!) are welcome.

Robert Dockins

PS the code currently includes a number of vestigial remnants of false 
starts, and is generally kind of ugly; you are warned.

robert dockins

Tomasz Zielonka

robert dockins

Bryce Bockman

Sebastian Sylvan

tags

participants (4)