
It seems I have found a hole in Haskell... :(
Before I start to develop a library with functions such us those on
http://haskell.org/hawiki/BinaryIo (hGetWord8, hGetWord16le, hGetWord16be,
etc), is there some reliable library that can help me ?
I basically need a set of functions to read binary data out of a Handle (a
higher lever of hGetBuf and hPutBuf). What I am doing is implementing a DNS
server, so, my typical need is:
First two bytes are an integer.
The next bit is a boolean.
So are the following three bits.
Then I have 4 bits which is also an integer.
Etc.
Any help is greatly appreciated.
Thank you.
--
Pupeno

pupeno:
It seems I have found a hole in Haskell... :( Before I start to develop a library with functions such us those on http://haskell.org/hawiki/BinaryIo (hGetWord8, hGetWord16le, hGetWord16be, etc), is there some reliable library that can help me ? I basically need a set of functions to read binary data out of a Handle (a higher lever of hGetBuf and hPutBuf). What I am doing is implementing a DNS server, so, my typical need is: First two bytes are an integer. The next bit is a boolean. So are the following three bits. Then I have 4 bits which is also an integer. Etc. Any help is greatly appreciated. Thank you.
Use NewBinary, as suggested by that wiki page :) This is the standard binary lib, used in ghc, nhc and lots of other projects (even lambdabot!). With this, and Data.Bits, you should be able to do whatever you need, I think. -- Don

On Dec 26, 2005 08:55 PM, Donald Bruce Stewart
pupeno:
It seems I have found a hole in Haskell... :( Before I start to develop a library with functions such us those on http://haskell.org/hawiki/BinaryIo (hGetWord8, hGetWord16le, hGetWord16be, etc), is there some reliable library that can help me ?
Use NewBinary, as suggested by that wiki page :)
Some caveats. The NewBinary library contains two things: (1) A uniform interface for reading and writing binary files and binary memory, including bitwise reading/writing. (2) A Binary class for serialization/deserialization While it is tempting to use the functions in the Binary class (put/get), it is important to realize that they do not care about big endian vs little endian. Nor do they attempt to be compatible with equivalent C data types. For example, Char is stored as 4 bytes. However, while the Binary class in NewBinary may not be appropriate, the uniform interface to binary files/memory could be a good foundation for building hGetWord16le, hGetWord16be, etc. If you submitted a module that added these functions, I would be glad to update the archive. j.

On Tuesday 27 December 2005 02:10, Jeremy Shaw wrote:
they do not care about big endian vs little endian. Does it mean that it just reads the data in whatever endianess the computer is in, right ?
However, while the Binary class in NewBinary may not be appropriate, the uniform interface to binary files/memory could be a good foundation for building hGetWord16le, hGetWord16be, etc.
If you submitted a module that added these functions, I would be glad to update the archive. I've been reading NewBinary's code, it is a bit intimidating for me[1]. It seems easier to define my own binary functions[2]. Could you point me a bit what is the unifor interface so I might give it another chance (to work with NewBinary) ? Thanks. -- Pupeno
(http://pupeno.com)
PS: Is anything wrong with [2] ? [1] I've been with Haskell for a little more than a couple of weeks. [2] Doing some experimentation I wrote: -- | Read a Word8 from a Ptr of any type. ptrToWord8 p = do let np :: Ptr Word8 np = castPtr p r <- (peek np) return r -- | Read a Word16 Big Endian from a Ptr of any type. ptrToWord16BE p = do b1 <- ptrToWord8 p b2 <- ptrToWord8 (plusPtr p 1) let nb1 :: Word16 nb1 = fromIntegral b1 nb2 :: Word16 nb2 = fromIntegral b2 return ((shift nb1 8) .|. nb2) -- | Read a Word16 Little Endian from a Ptr of any type. ptrToWord16LE p = do b1 <- ptrToWord8 p b2 <- ptrToWord8 (plusPtr p 1) let nb1 :: Word16 nb1 = fromIntegral b1 nb2 :: Word16 nb2 = fromIntegral b2 return ((shift nb2 8) .|. nb1) which I used to read Word16s successfully.

Hello Pupeno, Tuesday, December 27, 2005, 6:12:37 PM, you wrote:
they do not care about big endian vs little endian. P> Does it mean that it just reads the data in whatever endianess the computer is P> in, right ?
NewBinary read/write data in so-called network format, which is the same as in your ptrToWord16BE function, although i think that it is called Little-Endian (last byte is little one). it is the order used on power/sparc/motorola cpus -- Best regards, Bulat mailto:bulatz@HotPOP.com

On Tue, Dec 27, 2005 at 01:10:24AM -0300, Pupeno wrote:
It seems I have found a hole in Haskell... :(
I don't think so. Even if there was no way to do this, you could fill the hole yourself.
Before I start to develop a library with functions such us those on http://haskell.org/hawiki/BinaryIo (hGetWord8, hGetWord16le, hGetWord16be, etc), is there some reliable library that can help me ? I basically need a set of functions to read binary data out of a Handle (a higher lever of hGetBuf and hPutBuf). What I am doing is implementing a DNS server, so, my typical need is:
I think in the case of DNS, where the packets are quite small, it's best to separate reading the Handle from interpreting the data. For example, you can read from socket to a FastPackedString, and have a purely functional deserialization code work on it. Some time ago I was playing with DNS too. I have a library that can construct and interpret DNS packets, but it's a bit incomplete right now. It reads packets as Strings, but it should be quite straightforward to make it read and interpret FastPackedStrings. http://www.uncurry.com/repos/TzDNS I've just glanced at the code, and yes, it is a bit of a mess. I'll see if I can tidy it up. Perhaps you could help me? Best regards Tomasz -- I am searching for a programmer who is good at least in some of [Haskell, ML, C++, Linux, FreeBSD, math] for work in Warsaw, Poland

How is this different from the (un)pickle process that has been
discussed here recently? Recently I've seen the Binary discussions,
and the pickeling discussions, and I noticed they seemed to be talking
about the same process.
Brian McQueen
On 12/26/05, Tomasz Zielonka
On Tue, Dec 27, 2005 at 01:10:24AM -0300, Pupeno wrote:
It seems I have found a hole in Haskell... :(
I don't think so. Even if there was no way to do this, you could fill the hole yourself.
Before I start to develop a library with functions such us those on http://haskell.org/hawiki/BinaryIo (hGetWord8, hGetWord16le, hGetWord16be, etc), is there some reliable library that can help me ? I basically need a set of functions to read binary data out of a Handle (a higher lever of hGetBuf and hPutBuf). What I am doing is implementing a DNS server, so, my typical need is:
I think in the case of DNS, where the packets are quite small, it's best to separate reading the Handle from interpreting the data. For example, you can read from socket to a FastPackedString, and have a purely functional deserialization code work on it.
Some time ago I was playing with DNS too. I have a library that can construct and interpret DNS packets, but it's a bit incomplete right now. It reads packets as Strings, but it should be quite straightforward to make it read and interpret FastPackedStrings.
http://www.uncurry.com/repos/TzDNS
I've just glanced at the code, and yes, it is a bit of a mess. I'll see if I can tidy it up. Perhaps you could help me?
Best regards Tomasz
-- I am searching for a programmer who is good at least in some of [Haskell, ML, C++, Linux, FreeBSD, math] for work in Warsaw, Poland _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Mon, Dec 26, 2005 at 10:20:15PM -0800, Brian McQueen wrote:
How is this different from the (un)pickle process that has been discussed here recently? Recently I've seen the Binary discussions, and the pickeling discussions, and I noticed they seemed to be talking about the same process.
Not much different I guess, but I haven't followed the discussion closely. Anyway, the only value of my DNS code is that it already handles DNS packets in pure Haskell. BTW, if I abstract the deserialization monad using the BinaryParser type class, I can use different concrete backends, for example Parsec, low-level UArray based parser, FastPackedString based parser, NewBinary, etc. Can be handy for debugging - Parsec can give nice error messages, the only problem is that it insists in line/column positions. Best regards Tomasz -- I am searching for a programmer who is good at least in some of [Haskell, ML, C++, Linux, FreeBSD, math] for work in Warsaw, Poland

On Dec 26, 2005 10:20 PM, Brian McQueen
How is this different from the (un)pickle process that has been discussed here recently?
There is one important difference in my mind. Most of the pickling code is concerned with turning haskell data into a binary stream that can later be turned back into haskell data. There is no interoperability required -- the thing that writes it is the thing that will read it. For parsing arbitrary binary data formats a few other important issues arise that you can sometimes forget about with pickling such as: ~ byte-alignment ~ big-endian vs little endian ~ size of data types (for example is Char one byte or four) ~ bitwise parsing Both pickling and binary format parsing can be built on the same underlying low-level binary access methods. I have also wondered if they might be able to share the same high level interface because they don't really seem that far removed from one another in many respects. But I have not had time to think about it much. SerTH can be used to derive pickling code for arbitrary haskell data types. And I am pretty sure I once saw code that could derive a binary parser from a C struct. So, there really does seem to be a lot of overlap. j.

This is what I spent the past 3 months on. Pickling code that interoperates with a C++ server that sends things to me little- endian. And sends other wierd data like unicode strings that are zero- terminated. SerTH does not handle things like that since it cannot divine the wire format for you. I used the pickler combinators approach which worked great in theory since I could specify the wire format and have a single spec for pickling and unpickling. In practice this turned out to be too slow and I still don't know why. I'm still waiting to see how this should really be done. Joel On Dec 27, 2005, at 7:35 AM, Jeremy Shaw wrote:
There is one important difference in my mind. Most of the pickling code is concerned with turning haskell data into a binary stream that can later be turned back into haskell data. There is no interoperability required -- the thing that writes it is the thing that will read it.
For parsing arbitrary binary data formats a few other important issues arise that you can sometimes forget about with pickling such as:
~ byte-alignment ~ big-endian vs little endian ~ size of data types (for example is Char one byte or four) ~ bitwise parsing
Both pickling and binary format parsing can be built on the same underlying low-level binary access methods.

On Tue, Dec 27, 2005 at 08:37:27AM +0000, Joel Reymont wrote:
This is what I spent the past 3 months on. Pickling code that interoperates with a C++ server that sends things to me little- endian. And sends other wierd data like unicode strings that are zero- terminated.
SerTH does not handle things like that since it cannot divine the wire format for you. I used the pickler combinators approach which worked great in theory since I could specify the wire format and have a single spec for pickling and unpickling. In practice this turned out to be too slow and I still don't know why.
What do you mean by too slow? What is the desired input/output throughput? Maybe I'll find some time to improve my BinaryParser/BinaryUnparser code. Best regards Tomasz -- I am searching for a programmer who is good at least in some of [Haskell, ML, C++, Linux, FreeBSD, math] for work in Warsaw, Poland

Tomasz, Try http://wagerlabs.com/timeleak.tgz. See the "Killer pickler combinators" thread as well. My desired goal is to have 4k bots (threads?) running at the same time. At, say, 1k/s per bot I figure something like 4Mb/s round-trip. Each bot cannot spend more than a couple of seconds on pickling/ unpickling. I'm not even close to hitting that goal even reading from a file. I'm getting delays of 4s+ with just 100-200 bots reading from a file and even less than that in a networked environment. The more bots I run the higher the delays, to the point of frequent delays of 10s+. The kicker is that some packets come in at 4k compressed with Zlib but become something like 50k uncompressed and then expand to a list of 500+ records, etc. Each bot is given 5, 15 or 35 seconds to respond by the poker server and this is proving to be too little for my Haskell implementation. All the data comes in little-endian and there are a lot of zero- terminated unicode strings (each short needs to be swapped). My customer is running FreeBSD/Intel so swapping should not affect them but their performance is that much different than mine. It could well be that the scheduler is playing a role in this also. I spent 3 hard months (barely any weekends, no holidays) polishing my Haskell code. I started the Erlang rewrite at about 3pm this saturday and about 1.5 workdays later I have 80% of what I need. I expect to finish the rest today and will have a good yardstick to measure Haskell against. It took me ~2 hours of adapting the Pickler Combinators paper to Erlang. See http://wagerlabs.com/erlang/pickle.erl. Scroll to the bottom for examples. I spent the rest of the time converting the 150+ records to the new format. Once I deliver my rewritten project I would like to setup a Haskell vs. Erlang pickling contest to see if Haskell can match up and figure out what is preventing it from doing so if it can't. Then I lend a hand to add whatever is missing. My only requirement is that there be a _single_ spec for pickling and unpickling, i.e. no separate methods. The following is not acceptable to me ;-). puTableInfo :: PU TableInfo puTableInfo = PU fn1 fn2 fn3 where fn1 ptr ptr ti = do ptr <- appP endianW64 ptr $ ti_1 ti ptr <- appP endianW16 ptr $ ti_2 ti ptr <- appP endianW16 ptr $ ti_3 ti ptr <- appP w8 ptr $ ti_4 ti ptr <- appP wstring ptr $ ti_5 ti ptr <- appP endianW32 ptr $ ti_6 ti ptr <- appP (enum endianW16 :: PU GameType) ptr $ ti_7 ti ptr <- appP endianW16 ptr $ ti_8 ti ptr <- appP bool ptr $ ti_9 ti ptr <- appP endianW64 ptr $ ti_10 ti ptr <- appP endianW64 ptr $ ti_11 ti ptr <- appP endianW64 ptr $ ti_12 ti ptr <- appP endianW64 ptr $ ti_13 ti ptr <- appP endianW16 ptr $ ti_14 ti ptr <- appP (enum w8) ptr $ ti_15 ti ptr <- appP endianW32 ptr $ ti_16 ti ptr <- appP (enum w8) ptr $ ti_17 ti ptr <- appP endianW32 ptr $ ti_18 ti ptr <- appP (list endianW32 w8) ptr $ ti_19 ti ptr <- appP endianW32 ptr $ ti_20 ti return $! ptr fn2 ptr ptr = do (ti1, ptr) <- appU endianW64 ptr (ti2, ptr) <- appU endianW16 ptr (ti3, ptr) <- appU endianW16 ptr (ti4, ptr) <- appU w8 ptr (ti5, ptr) <- appU wstring ptr (ti6, ptr) <- appU endianW32 ptr (ti7, ptr) <- appU (enum endianW16 :: PU GameType) ptr (ti8, ptr) <- appU endianW16 ptr (ti9, ptr) <- appU bool ptr (ti10, ptr) <- appU endianW64 ptr (ti11, ptr) <- appU endianW64 ptr (ti12, ptr) <- appU endianW64 ptr (ti13, ptr) <- appU endianW64 ptr (ti14, ptr) <- appU endianW16 ptr (ti15, ptr) <- appU (enum w8) ptr (ti16, ptr) <- appU endianW32 ptr (ti17, ptr) <- appU (enum w8) ptr (ti18, ptr) <- appU endianW32 ptr (ti19, ptr) <- appU (list endianW32 w8) ptr (ti20, ptr) <- appU endianW32 ptr return $! (TableInfo ti1 ti2 ti3 ti4 ti5 ti6 ti7 ti8 ti9 ti10 ti11 ti12 ti13 ti14 ti15 ti16 ti17 ti18 ti19 ti20, ptr) Joel On Dec 27, 2005, at 8:44 AM, Tomasz Zielonka wrote:
What do you mean by too slow? What is the desired input/output throughput?
Maybe I'll find some time to improve my BinaryParser/BinaryUnparser code.

From: Joel Reymont
To: Tomasz Zielonka CC: Jeremy Shaw ,haskell-cafe@haskell.org Subject: Re: [Haskell-cafe] binary IO Date: Tue, 27 Dec 2005 09:18:54 +0000 Tomasz,
Try http://wagerlabs.com/timeleak.tgz. See the "Killer pickler combinators" thread as well.
My desired goal is to have 4k bots (threads?) running at the same time. At, say, 1k/s per bot I figure something like 4Mb/s round-trip. Each bot cannot spend more than a couple of seconds on pickling/ unpickling. I'm not even close to hitting that goal even reading from a file.
I'm getting delays of 4s+ with just 100-200 bots reading from a file and even less than that in a networked environment. The more bots I run the higher the delays, to the point of frequent delays of 10s+. The kicker is that some packets come in at 4k compressed with Zlib but become something like 50k uncompressed and then expand to a list of 500+ records, etc.
I have C++ concurrent server that performs 2600 reqs/sec on about 500 connections and dual Xeon 2.8Ghz, but no pickling /unpickling, just short text. Has sepparate IO threads that divide descriptor sets (num descs / IO thread) and worker threads as number of CPU's * 2, no locking of shared queue. So with 4k connections I guess that would be maximum 2k requests on *dual* box per second, without pickling / unpickling, just short textual protocol and simple services. I think that you will get hard time even with C to achieve your goal. Greetings, Bane. _________________________________________________________________ Don't just search. Find. Check out the new MSN Search! http://search.msn.com/

We'll see, Erlang is built for this type of stuff. I might have results from the "timeleak" test today and will probably have first networking results tomorrow. But I wish I could achieve even a fraction of that with Haskell. On Dec 27, 2005, at 9:51 AM, Branimir Maksimovic wrote:
I have C++ concurrent server that performs 2600 reqs/sec on about 500 connections and dual Xeon 2.8Ghz, but no pickling /unpickling, just short text. Has sepparate IO threads that divide descriptor sets (num descs / IO thread) and worker threads as number of CPU's * 2, no locking of shared queue. So with 4k connections I guess that would be maximum 2k requests on *dual* box per second, without pickling / unpickling, just short textual protocol and simple services. I think that you will get hard time even with C to achieve your goal.

Hello Joel, Tuesday, December 27, 2005, 12:18:54 PM, you wrote: JR> My desired goal is to have 4k bots (threads?) running at the same JR> time. At, say, 1k/s per bot I figure something like 4Mb/s round-trip. no problem. my library handle about 10-15mb/s, and i think that speed can be doubled by using unboxed ints JR> I'm getting delays of 4s+ with just 100-200 bots reading from a file divide and conquer! calc speed of networking, unzipping, unpickling separately. compare these speeds with overall program througput to calc multithreading "expenses" JR> and even less than that in a networked environment. The more bots I JR> run the higher the delays, to the point of frequent delays of 10s+. these delays says nothing about speed. you are mixing two things - your end goal is to make delays short, but your instrument - speeds of different processes and to decide what you need to optimize you must calc these speeds separately. without it, your "optimization" is just random game JR> Each bot is given 5, 15 or 35 seconds to respond by the poker server so you don't need to create complex timer thread machinery, just use 3 threads which proceed requests in FIFO order JR> I spent 3 hard months (barely any weekends, no holidays) polishing my JR> Haskell code. ... wasting your time to speed up code that is slow by design. don't forget that this article was published as FUNCTIONAL pearl, not damn-fast pearl JR> I started the Erlang rewrite at about 3pm this saturday JR> and about 1.5 workdays later I have 80% of what I need. I expect to JR> finish the rest today and will have a good yardstick to measure JR> Haskell against. i think that you will get just the same problems as with Haskell and forced to switch back because it's easier to low-level optimize in Haskell than in Erlang JR> My only requirement is that there be a _single_ spec for pickling and JR> unpickling, i.e. no separate methods. The following is not acceptable JR> to me ;-). say exactly WHY you need this single spec. i know at least 4 solutions, what is the best depends on your exact needs for example, one of these solutions looks like this: instance Binary TableInfo where put_ bh (TableInfo a b c d) = put_ bh (a,b,c,d) get bh = do (a,b,c,d) <- get bh; return (TableInfo a b c d) instance Binary GameType where put_ bh = putWord16 bh . fromEnum get = liftM toEnum . getWord16 .... -- Best regards, Bulat mailto:bulatz@HotPOP.com

Bulat, On Dec 27, 2005, at 1:58 PM, Bulat Ziganshin wrote:
no problem. my library handle about 10-15mb/s, and i think that speed can be doubled by using unboxed ints
Would you like to present your version of the timeleak code plus statistics from a test run? This will demonstrate the technical superiority of your library. I'm sure it can talk the talk but can it walk the walk? Please make sure that you have bots reading from the file all at the _at the same time_, not sequentially.
these delays says nothing about speed. you are mixing two things - your end goal is to make delays short, but your instrument - speeds of different processes and to decide what you need to optimize you must calc these speeds separately. without it, your "optimization" is just random game
I need to reduce the overall delay. Individual delays do me no good. Show me a 1-2s delay in the "timeleak" unstuff code then I will be convinced.
JR> Each bot is given 5, 15 or 35 seconds to respond by the poker server
so you don't need to create complex timer thread machinery, just use 3 threads which proceed requests in FIFO order
Socket read is a blocking operation. An easy way to handle a blocking read is to launch a thread that reads from a socket and posts packets somewhere once they are retrieved. I cannot handle large numbers of bots if I block on a socket read.
JR> I spent 3 hard months (barely any weekends, no holidays) polishing my JR> Haskell code.
... wasting your time to speed up code that is slow by design. don't forget that this article was published as FUNCTIONAL pearl, not damn-fast pearl
Not quite. I ended up with the pickler code but I did not start with it.
i think that you will get just the same problems as with Haskell and forced to switch back because it's easier to low-level optimize in Haskell than in Erlang
Absolutely not. I wrote a poker server in Erlang (see the openpoker/ erlang sections of http://wagerlabs.com) so I know what I'm talking about. There's usually no need to optimize anything low-level with Erlang, it's fast enough as it is for network operations. It's also optimized for pickling as well. See http://erlang.se/doc/doc-5.4.12/ doc/programming_examples/bit_syntax.html#4, for example.
JR> My only requirement is that there be a _single_ spec for pickling and JR> unpickling, i.e. no separate methods. The following is not acceptable JR> to me ;-).
say exactly WHY you need this single spec. i know at least 4 solutions, what is the best depends on your exact needs
Because I have 150 records with a lot of fields and writing separate code for pickling and unpickling is a nightmare?
for example, one of these solutions looks like this:
instance Binary TableInfo where put_ bh (TableInfo a b c d) = put_ bh (a,b,c,d) get bh = do (a,b,c,d) <- get bh; return (TableInfo a b c d)
instance Binary GameType where put_ bh = putWord16 bh . fromEnum get = liftM toEnum . getWord16
This gets tedious very quickly. Thanks, Joel -- http://wagerlabs.com/

Hello Joel, Tuesday, December 27, 2005, 6:24:56 PM, you wrote:
no problem. my library handle about 10-15mb/s, and i think that speed can be doubled by using unboxed ints
JR> Would you like to present your version of the timeleak code plus JR> statistics from a test run? do it yourself. i can help you if you want to try JR> Please make sure that you have bots reading from the file all at the JR> _at the same time_, not sequentially. you must decide what is your problem - unpickling, multithreading, someone else. are you waiting what i can do it for you? :)
these delays says nothing about speed. you are mixing two things - your end goal is to make delays short, but your instrument - speeds of different processes and to decide what you need to optimize you must calc these speeds separately. without it, your "optimization" is just random game
JR> I need to reduce the overall delay. Individual delays do me no good. JR> Show me a 1-2s delay in the "timeleak" unstuff code then I will be JR> convinced. if, for example, the game work too slow on your computer, you must decide what is a problem - CPU, graphics card, lack of memory or something else and upgrade this detail first. you can, for example, buy the graphics card for $1000 but still have just memory swapping as main problem. believe you or not, but the same holds true in programming - the key to right optimization is to find "bottlenecks", not optimizing everything you can. spending several weeks to random optimization is like buying a gold computer case trying to speed up the game :) sorry, but your negative results say more about you than about GHC or Haskell. you are working 3 months but still don't know what is a bottleneck in your program!
JR> Each bot is given 5, 15 or 35 seconds to respond by the poker server
so you don't need to create complex timer thread machinery, just use 3 threads which proceed requests in FIFO order
JR> Socket read is a blocking operation. An easy way to handle a blocking JR> read is to launch a thread that reads from a socket and posts packets JR> somewhere once they are retrieved. I cannot handle large numbers of JR> bots if I block on a socket read. i say about timer thread. you created complex design using Map to find first event to complete, but instead you can use just 3 timer threads, each one serving a fixed timing interval. each thread will serve requests just in the order they arrive
i think that you will get just the same problems as with Haskell and forced to switch back because it's easier to low-level optimize in Haskell than in Erlang
JR> Absolutely not. I wrote a poker server in Erlang (see the openpoker/ it's strange what your pocker server don't required the same pickler library as client. this 50k structure is constructed on the server end, i'm right? JR> Because I have 150 records with a lot of fields and writing separate JR> code for pickling and unpickling is a nightmare? these concrete definitions are nightmare? :) let's compare them with that clear and obvious code: puTableInfo :: PU TableInfo puTableInfo = sequ ti_1 endian64 (\a -> sequ ti_2 endian16 (\b -> sequ ti_3 endian16 (\c -> sequ ti_4 byte (\d -> sequ ti_5 wstring (\e -> sequ ti_6 endian32 (\f -> sequ ti_7 (enum endian16 :: PU GameType) ........ (\w -> lift $! TableInfo a b c d e f g h i j k l m n o p q r v w ))))))))))))))))))))
for example, one of these solutions looks like this:
instance Binary TableInfo where put_ bh (TableInfo a b c d) = put_ bh (a,b,c,d) get bh = do (a,b,c,d) <- get bh; return (TableInfo a b c d)
instance Binary GameType where put_ bh = putWord16 bh . fromEnum get = liftM toEnum . getWord16
JR> This gets tedious very quickly. no, type classes allow code economy comparing to combinators you use. you need to define only one time how to process each type. just one definition for all lists, one for all Word16, one for all GameType and so on. all the records defined by the easy code like this one for TableInfo -- Best regards, Bulat mailto:bulatz@HotPOP.com

On Dec 27, 2005, at 4:52 PM, Bulat Ziganshin wrote:
spending several weeks to random optimization is like buying a gold computer case trying to speed up the game :)
I did not spend several weeks on optimization. I went through about 25 iterations with the timeleak code and the profiler. I watched the contribution of each function to the overall time and tried to improve that contribution.
sorry, but your negative results say more about you than about GHC or Haskell. you are working 3 months but still don't know what is a bottleneck in your program!
You are right in that I spent the first few weeks learning. By now I know that pickling is the bottleneck, though. The timeleak code is very simple. It forks X threads where each thread opens a file for reading. Each thread reads the file record by record and sounds an alarm if the time taken to unpickle a record exceeds, say, 3 seconds. There's also a clear relationship between the number of threads launched and the length of the delay so concurrency has an effect as well. I'm waiting for Simon Marlow to come back to take a look at it with him.
i say about timer thread. you created complex design using Map to find first event to complete, but instead you can use just 3 timer threads, each one serving a fixed timing interval. each thread will serve requests just in the order they arrive
I use just one timer thread right now, profiling does not tell me that timers are a problem. Your solution is also not demonstrably superior so I'm not pressed to try it.
it's strange what your pocker server don't required the same pickler library as client. this 50k structure is constructed on the server end, i'm right?
I'm working with a different poker server now. Mine actually uses lots of small packets that are easy to process. This particular server is way different.
no, type classes allow code economy comparing to combinators you use. you need to define only one time how to process each type. just one definition for all lists, one for all Word16, one for all GameType and so on. all the records defined by the easy code like this one for TableInfo
Quite possibly. I'm looking for hard proven results at this point and while you are making bold claims about the throughput of your pickling library you are not willing to demonstrate this throughput in a real-life situation. I cannot wager on your approach as it's not proven. I'm going with a proven solution for now (Erlang) but will come back to Haskell in a pinch as soon as I know "how to do the right thing". This discussion is a bit of a dead-end unless you are willing to take the next step and apply your solution to my timeleak repro case. If you or someone else is willing to do that than I can continue my set of profiler reports and eventually get some closure. It will happen once either 1) the last bit of performance is squeezed out of pickling and it's determined that threading or the scheduler is the bottleneck or 2) things work out nicely. I carefully kept track of what changes I made and how they affected speed, memory consumption, etc. I maxed out on the optimizations I have been able to make without dropping pickler combinators altogether. I no longer have the time to continue this but since a lot of people are concerned with Haskell speed I'm suggesting that someone takes over with their "best of the breed" solution to the problem and uses the same process of saving profiling reports and tracking changes until they optimize to the max or give up. Then we can lay out the series of profiling reports in a storyboard of sorts, with changes from report to report described. This would serve a great "how to write efficient Haskell" manual. Joel -- http://wagerlabs.com/

On Tue, 2005-12-27 at 18:35 +0000, Joel Reymont wrote: Hi Joel!
Then we can lay out the series of profiling reports in a storyboard of sorts, with changes from report to report described. This would serve a great "how to write efficient Haskell" manual.
We are with you watching your attempts to iron this code, and if it ends with the above "efficient" manual - great. Something like that is very welcome for all those making transition from workable to efficient code. Sincerely, Gour

That's great to hear! I will continue once I have a chance to discuss it with the gurus and optimize it further. At the same time, I would challenge everyone with a fast IO library to plug it into the timeleak code, run it under a profiler and post the results (report + any alarms). The timeleak code defines just a couple of records, it's not a lot of work to plug a different pickler in. Folks, please us help ourselves, put your money where your mouth is! Thanks, Joel On Dec 27, 2005, at 7:31 PM, Gour wrote:
We are with you watching your attempts to iron this code, and if it ends with the above "efficient" manual - great.
Something like that is very welcome for all those making transition from workable to efficient code.

Joel Reymont writes:
I would challenge everyone with a fast IO library to plug it into the timeleak code, run it under a profiler and post the results (report + any alarms).
My guess is that you would learn more if _you_ would plug the different IO libraries into your test code. I'm certain the respective library authors will be quite happy to answer questions and to investigate unexpected results. The enthusiasm you put into this subject is very much appreciated. Your collected results will be invaluable for the Haskell community. Thank you for not giving up on the language at the first sign of trouble. It's great that you're so curious about Haskell and put so much effort into finding a good, efficient solution for your needs. Peter

I will have to leave this for a while. I apologize but I'm more than a bit frustrated at the moment and it's not fair of me to take it out on everybody else. If someone is willing to take this further I will appreciate it, otherwise I'll get to it in the coming weeks. Besides knowing how to do it better I would like to know why it works poorly as it is. So it's gonna take me a while. Thanks, Joel On Dec 27, 2005, at 8:26 PM, Peter Simons wrote:
My guess is that you would learn more if _you_ would plug the different IO libraries into your test code. I'm certain the respective library authors will be quite happy to answer questions and to investigate unexpected results.

Joel Reymont writes:
I will have to leave this for a while. I apologize but I'm more than a bit frustrated at the moment and it's not fair of me to take it out on everybody else.
From what I can tell you have mastered a lot of sophisticated language theory in a very short time. Part of the reason why no-one can give you simple answers to your questions is that we don't know these answers either. Just by asking those questions you have already extended the boundaries of what the Haskell community at large knows and understands. Give things a little time to sink in, and then try it again. Even if you ultimately decide to write your application in another language, you'll find
Never mind. Haskell has a very high potential for frustrating newcomers. I went through the exact same experience when I wrote my first network code, and I still marvel at the patience the nice folks on these mailing lists had with all my complaints. that knowing and understanding Haskell will change the way you design software -- regardless of the language you use. Please don't give up now that you have come this far. Peter

Amen! Haskell has forever realigned my mind-gears and I'm observing positive results as we speak :-). On Dec 28, 2005, at 1:56 AM, Peter Simons wrote:
Even if you ultimately decide to write your application in another language, you'll find that knowing and understanding Haskell will change the way you design software -- regardless of the language you use.

I would compare Haskell to visiting the chiropractor. You will walk straighter, stand taller and your life will never be the same :D. On Dec 28, 2005, at 1:56 AM, Peter Simons wrote:
you'll find that knowing and understanding Haskell will change the way you design software -- regardless of the language you use.

Hello Peter, Tuesday, December 27, 2005, 11:26:29 PM, you wrote: PS> My guess is that you would learn more if _you_ would plug PS> the different IO libraries into your test code. I'm certain Peter, because you claimed that Haskell can be made as effective as C, please help us :) your BlockIO library is great, but it's usage is limited to very specific sutuations - when we can save pass state between processing of individual bytes what for (de)serialization tasks? my best attempts is still 10x slower than C version. can you roll up little prototype for such library or just sketch an ideas so i can try to implement it? it is also call to everyone - what is the key to efficient Binary lib? you can see my current attempt in http://freearc.narod.ru/Binary.tar.gz the key functions: instance (MemoryStream h) => ByteStream (BufferedMemoryStream h) where vPutByte mem@(Buf h buf' pos' end') byte = do pos <- readPtr pos' end <- readPtr end' if (pos==end) then do sendCurrentBuffer mem receiveNextBuffer mem vPutByte mem byte else do writePtr pos' $! (pos+:1) writeByteAt pos $! (fromEnum byte) vGetByte mem@(Buf h buf' pos' end') = do pos <- readPtr pos' end <- readPtr end' if (pos==end) then do sendCurrentBuffer mem receiveNextBuffer mem vGetByte mem else do writePtr pos' $! (pos+:1) byte <- readByteAt pos return $! (toEnum byte) and series of getWordXX/putWordXX in "class (ByteStream h) => BitStream h": putWord32 h w = do vPutByte h $! ( w `shiftR` 24) vPutByte h $! ((w `shiftR` 16) .&. 0xff) vPutByte h $! ((w `shiftR` 8) .&. 0xff) vPutByte h $! ( w .&. 0xff) getWord32 h = do w1 <- vGetByte h w2 <- vGetByte h w3 <- vGetByte h w4 <- vGetByte h return $! ((w1 `shiftL` 24) .|. (w2 `shiftL` 16) .|. (w3 `shiftL` 8) .|. (w4)) -- Best regards, Bulat mailto:bulatz@HotPOP.com

Bulat Ziganshin writes:
your BlockIO library is great, but it's usage is limited to very specific sutuations - when we can save pass state between processing of individual bytes
In my experience, any other approach to I/O is slow. If you don't have an explicit state between processing of individual bytes, you have an implicit state. Making that state (the I/O buffer) explicit gives you control over how it is used and how it is evaluated. With an implicit state (lazy evaluation), you have no control. Fast I/O is a matter of application design. BlockIO is fast because its API forces you to design your applications as stateful, interruptible computations -- a finite state machine. If you don't want to design your I/O application as a finite state machine, then it will be slow regardless of the I/O library you use. It sucks, but that is my experience. This phenomenon isn't specific to Haskell, by the way. C++'s std::iostream is another fine example for an implicit state API that is beautiful, elegant, and quite useless for high-performance I/O.
what for (de)serialization tasks? my best attempts is still 10x slower than C version. can you roll up little prototype for such library or just sketch an ideas so i can try to implement it?
The "Fast I/O" article I posted a few days ago is my unfinished attempt at writing an efficient, general-purpose binary I/O library for Haskell. I don't know how soon I'll be able to complete that, nor do I know whether it would be useful to many Haskell programmers if I do complete it. The original BlockIO code has been stable (and quite fast) for over a year or so, but I wouldn't know of anyone actually using it. Apparently, designing explicit state data types is nothing the Haskell community is fond of. :-) Peter

Hello Peter, Thursday, December 29, 2005, 5:58:29 PM, you wrote: PS> The "Fast I/O" article I posted a few days ago is my PS> unfinished attempt at writing an efficient, general-purpose PS> binary I/O library for Haskell. where i can find it? -- Best regards, Bulat mailto:bulatz@HotPOP.com

Hi Bulat,
general-purpose binary I/O library for Haskell.
where i can find it?
the module is available here: http://cryp.to/blockio/fast-io.html http://cryp.to/blockio/fast-io.lhs The article is incomplete and a bit messy, but the code works fine. Feedback and ideas for improvement are very welcome. Peter

Hello Joel, Tuesday, December 27, 2005, 9:35:13 PM, you wrote: JR> Quite possibly. I'm looking for hard proven results at this point and JR> while you are making bold claims about the throughput of your JR> pickling library you are not willing to demonstrate this throughput JR> in a real-life situation. i say about speed of unpickling only, your program may have more bottlenecks. you can see my tests in Main.hs, for your task appropriate benchmark code will be: main = do buf <- mallocBytes (100*10^6) h <- createBufferedMemoryStream =<< openMemory buf size go (\_->get h) (25*10^6::Int) -- Best regards, Bulat mailto:bulatz@HotPOP.com

Joel Reymont wrote:
You are right in that I spent the first few weeks learning. By now I know that pickling is the bottleneck, though. The timeleak code is very simple. It forks X threads where each thread opens a file for reading. Why on earth do you want each tread to open the file and unpickle? Why not unpickle once and reuse it? Or, if this is just a test and in the future they will all read from different files (or sockets), then maybe you are hitting on a different bottleneck than you think.
This discussion is a bit of a dead-end unless you are willing to take the next step and apply your solution to my timeleak repro case. If you or someone else is willing to do that than I can continue my set of profiler reports and eventually get some closure. It will happen once either 1) the last bit of performance is squeezed out of pickling and it's determined that threading or the scheduler is the bottleneck or 2) things work out nicely.
Instead of starting optimizing a particular pickling library perhaps you should have tried different libraries and picked the best one. Since this is your project I don't think your project I don't think you can expect others to test things for you. Well, maybe if you post your code as a challange. ;) -- Lennart

On Dec 28, 2005, at 11:40 AM, Lennart Augustsson wrote:
Why on earth do you want each tread to open the file and unpickle? Why not unpickle once and reuse it? Or, if this is just a test and in the future they will all read from different files (or sockets), then maybe you are hitting on a different bottleneck than you think.
Right, they will be reading from different sockets.
Instead of starting optimizing a particular pickling library perhaps you should have tried different libraries and picked the best one.
Well, I thought pickler combinators were good but yes, you are right.
Since this is your project I don't think your project I don't think you can expect others to test things for you. Well, maybe if you post your code as a challange. ;)
I did. http://wagerlabs.com/timeleak.tgz Joel -- http://wagerlabs.com/

On 12/28/05, Joel Reymont
On Dec 28, 2005, at 11:40 AM, Lennart Augustsson wrote:
Why on earth do you want each tread to open the file and unpickle? Why not unpickle once and reuse it? Or, if this is just a test and in the future they will all read from different files (or sockets), then maybe you are hitting on a different bottleneck than you think.
Right, they will be reading from different sockets.
I think what he's trying to say is that reading from disk is different than reading from sockets so doing tests with disk reading is misrepresenting the problem and may indicate a bottle-neck that isn't there. How does this work if you remove the file-reading? I mean just putting the file on a small TCP/IP file server with some simulated latency and bandwidth limitation, and then connecting to that in each thread? /S -- Sebastian Sylvan +46(0)736-818655 UIN: 44640862

On Dec 28, 2005, at 1:05 PM, Sebastian Sylvan wrote:
How does this work if you remove the file-reading? I mean just putting the file on a small TCP/IP file server with some simulated latency and bandwidth limitation, and then connecting to that in each thread?
This is probably the way to go but I don't have a a corresponding Haskell repro case. This is what my application does, basically, and I'm getting timeouts there. Which is the reason that I created the file-based test in the first place. The results from the two seemed very similar, i.e. timeouts in both versions. I will know how the full-blown Erlang app behaves today. 3 months of Haskell vs. 3 days of Erlang. There's the right tool for every job and while Haskell code could be tweaked to be on par with Erlang doing this will be missing the point. Using Haskell for this networking app forced me to focus on all the issues _but_ the business logic. Type constraints, binary IO and serialization, minimizing memory use and fighting laziness, timers, tweaking concurrency and setting up message channels, you name it. There are no type constraints to take care of with Erlang since it's dynamically typed. It's also strict and has excellent support for binary IO, serialization, timers and so on. All this is a given. You can just assume that doing it the most natural way is "doing the right thing" and can focus on the business logic. The Erlang VM and distribution has been tweaked and polished to no end by Ericsson with the single-minded networking focus. It's a specialist language designed for high-concurrency, fault-tolerance, scalability and protocol processing. It's also a pure functional language and it doesn't even have the IO monad. Alternatively, you could say that the IO monad is hidden away in the built-in library functions. You still cannot mutate a variable, you are always copying. You can probably try to do all sorts of things with Erlang (look at Wings 3D) but I would probably just stick to high concurrency and networking apps. I would like to see if Yampa (and Frag) can be more transparently done in Erlang but that's because I find Yampa extremely complicated. Otherwise, as you try to do things that Erlang probably wasn't optimized for, you will face limitations just like I did with Haskell. There's not a lot of business logic to focus on in my app. I just had to add a thin serialization veneer (http://wagerlabs.com/erlang/ pickle.erl) that implements the pickler combinators and ... I was done. ZLib compression is built into the Erlang distribution and since the Erlang FFI makes me cringe I implemented the custom SSL encryption as a single-threaded UDP server in about 170 lines of C. This is how i was able to finish 30 times faster. Joel -- http://wagerlabs.com/

On Tue, Dec 27, 2005 at 09:18:54AM +0000, Joel Reymont wrote:
Try http://wagerlabs.com/timeleak.tgz. See the "Killer pickler combinators" thread as well.
Let's see if I understand correctly. There are 17605 messages in trace.dat. On my hardware the average message unpicking time is 0.0002s when you only have a single thread. So, it indeed seems that with 1000 threads it should be possible to process every message in under 4 seconds. Right now I can think of two reasons: - 1000 treads need much data in the help, which increases the cost of GC and with frequent context switches results in poor cache performance - the GHC's process scheduler is not as intelligent as Erlang's
I'm getting delays of 4s+ with just 100-200 bots reading from a file and even less than that in a networked environment. The more bots I run the higher the delays, to the point of frequent delays of 10s+. The kicker is that some packets come in at 4k compressed with Zlib but become something like 50k uncompressed and then expand to a list of 500+ records, etc.
One possible solution is to reduce the number of simultaneously running unpicklings/ picklings (I think someone already proposed it, was that Bulat?). It would reduce the amount of needed memory and should improve cache usage. But then, what will be the delay observed by the server? Anyway, I've made some tests, and I see the problem persists. I'll try to test with my unpickling code.
Each bot is given 5, 15 or 35 seconds to respond by the poker server and this is proving to be too little for my Haskell implementation.
This is per message, right?
My only requirement is that there be a _single_ spec for pickling and unpickling, i.e. no separate methods. The following is not acceptable to me ;-).
What if the spec was the data type itself? When I was dealing with a proprietary Intel-based binary format, I derived my picklers / unpicklers with TH from data type definitions. Of course, there were cases were the derived code would be wrong, and then I had to write the code myself, but it was for about 2-3 record types out of total 30. Best regards Tomasz -- I am searching for a programmer who is good at least in some of [Haskell, ML, C++, Linux, FreeBSD, math] for work in Warsaw, Poland

On Dec 27, 2005, at 10:30 PM, Tomasz Zielonka wrote:
Let's see if I understand correctly. There are 17605 messages in trace.dat. On my hardware the average message unpicking time is 0.0002s when you only have a single thread. So, it indeed seems that with 1000 threads it should be possible to process every message in under 4 seconds.
I'm on a PowerBook G4 1.25Ghz with 1Gb of memory and lots of apps running. I have about 60-70Mb of physical memory free although other apps are barely consuming CPU time. This is what I get from the Erlang version. Columns are # of processes, average time per message, peak observed physical memory and peak observed virtual memory usage respectively. 1 - 2.34322e-5s, 10 - 3.91232e-5s, 18Mb, 50Mb 100 - 3.26753s, 70Mb, 100Mb, all 100 failed since alarms were set at 3 seconds. I just noticed that I'm unpickling all the packets whereas timeleak only looks at compressed commands and unpickles server info packets from those. I also made ~160Mb of physical memory available and decided to read some web articles while the tests are running (browser already running). Another run... 1 - 1.00657e-6s 10 - 1.10232e-6s 100 - 3.09583s, 55Mb, 90Mb, all 100 failed 1000 - 25s. All failed rather quickly. The issue could be that they are all stumbling at the same big packets at about the same time. So I inserted a random delay of between 100ms and 3.1s and got an average of 2.96161e-2s with 77 failures out of 100. On 1000 it's 957 failed with slightly more than 3s and 1.12748e-6 on the rest. The comparison is still a bit unfair as Haskell compiles to native code whereas I was running the above test using the regular bytecode VM. With native compilation enabled I get 1 - 1.00359e-6s 10 - 1.08691e-6s 100 - 6.19101e-3s with 87 out of 100 failed at about 3.5s. 100 - 1.12210e-6s and 0 failed on another run. The difference is in the random delays between bot starts, apparently. You are well off so long as bots don't hit compressed packets all at once. The big packets decompress into 50k and are a hassle to unpickle. Now here's the kicker... Using the read_ahead option when opening the file gives you a ~64K buffer. Another run... 10 - 1.06194e-6 100 - 1.05641e-6 1000 - 1.06799e-6 and 916 failed with time between 3s and 4s Increasing alarm time to 4s, using the native compiler with all optimizations (erlc +native +o3 +inline *erl) gives me 10 - 1.10848e-6s 100 - 1.24159e-6s, 0 failed 1000 - 1.02611e-6s, 923 failed
Right now I can think of two reasons: - 1000 treads need much data in the help, which increases the cost of GC and with frequent context switches results in poor cache performance - the GHC's process scheduler is not as intelligent as Erlang's
It's clear to me by now that the app is not language or compiler-bound.
One possible solution is to reduce the number of simultaneously running unpicklings/ picklings (I think someone already proposed it, was that Bulat?). It would reduce the amount of needed memory and should improve cache usage.
But then, what will be the delay observed by the server?
Right, you can't really do this as it will increase the overall delay.
Each bot is given 5, 15 or 35 seconds to respond by the poker server and this is proving to be too little for my Haskell implementation.
This is per message, right?
Per message, yes. I will code the networking part of my Erlang version and will report whether I got more/less timeouts than tha Haskell version.
What if the spec was the data type itself? When I was dealing with a proprietary Intel-based binary format, I derived my picklers / unpicklers with TH from data type definitions. Of course, there were cases were the derived code would be wrong, and then I had to write the code myself, but it was for about 2-3 record types out of total 30.
Wish I could do that. I don't know TH at all and any TH I got was due to the folks on #haskell and here providing it (thanks Cale!). Joel -- http://wagerlabs.com/

On Tue, Dec 27, 2005 at 07:00:12AM +0100, Tomasz Zielonka wrote:
I've just glanced at the code, and yes, it is a bit of a mess. I'll see if I can tidy it up. Perhaps you could help me?
I've just made the parser more generic, which helped to reduce code duplication. Maybe I will finally manage to do something with this code. Best regards Tomasz -- I am searching for a programmer who is good at least in some of [Haskell, ML, C++, Linux, FreeBSD, math] for work in Warsaw, Poland

On 27.12 07:00, Tomasz Zielonka wrote:
Some time ago I was playing with DNS too. I have a library that can construct and interpret DNS packets, but it's a bit incomplete right now. It reads packets as Strings, but it should be quite straightforward to make it read and interpret FastPackedStrings.
Nice, here is my shot at DNS - http://cs.helsinki.fi/u/ekarttun/haskell/hdnsd-20051227.tar.bz2 feel free to take bits if you are interested. The serialization/deserialization uses Ptrs. - Einar Karttunen

Hello Pupeno, Tuesday, December 27, 2005, 7:10:24 AM, you wrote: P> It seems I have found a hole in Haskell... :( lazy language can't contain holes, they are just yet unevaluated thunks :) P> I basically need a set of functions to read binary data out of a Handle (a P> higher lever of hGetBuf and hPutBuf). What I am doing is implementing a DNS P> server, so, my typical need is: P> First two bytes are an integer. P> The next bit is a boolean. P> So are the following three bits. P> Then I have 4 bits which is also an integer. it's my day :) main = do h <- openBinaryFile "test" WriteMode bh <- openBitAligned h putWord16 bh (101::Int) put_ bh True putBits bh 3 (2::Int) flushBits bh hClose h h <- openBinaryFile "test" ReadMode bh <- openBitAligned h a <- getWord16 bh :: IO Int b <- get bh :: IO Boolean c <- getBits bh 3 :: IO Int print (a,b,c) hClose h http://freearc.narod.ru/Binary.tar.gz but the better way is defining instance of Binary class: data DNS = DNS Int Bool Int deriving Show instance Binary DNS where put_ bh (DNS a b c) = do putWord16 bh a put_ bh b putBits bh 3 c get bh = do a <- getWord16 bh b <- get bh c <- getBits bh 3 return (DNS a b c) main = do h <- openBinaryFile "test" WriteMode bh <- openBitAligned h put_ bh (DNS 37 True 3) flushBits bh hClose h h <- openBinaryFile "test" ReadMode bh <- openBitAligned h dns <- get bh :: IO DNS print dns hClose h you must use `openBitAligned` to open bit-aligned stream over Handle, and use `flushBits` at the end of writing. there is much more features, ask me about what you need disclaimer: the library was written in last 3 weeks, and you will be (if you want) its first user :) -- Best regards, Bulat mailto:bulatz@HotPOP.com
participants (13)
-
Branimir Maksimovic
-
Brian McQueen
-
Bulat Ziganshin
-
dons@cse.unsw.edu.au
-
Einar Karttunen
-
Gour
-
Jeremy Shaw
-
Joel Reymont
-
Lennart Augustsson
-
Peter Simons
-
Pupeno
-
Sebastian Sylvan
-
Tomasz Zielonka