Recently I found  http://hackage.haskell.org/package/store which I missed during my previous investigation. It based on same principles and runs at same speed as my prototype. And it ready-to-use. It seems I just reinvented what was done before by FP Complete :)

ср, 20 мар. 2019 г. в 00:58, Станислав Черничкин <schernichkin@gmail.com>:

I was investigating Haskell’s binary serialization libraries and has found that lot of them (binary, cereal and protocol-buffers) has pretty similar design. They capable to consume lazy bytestrings and either produce result, report error or produce partial result, indicating that more input data required. Such design leads to quite complex readers, and often ends up with producing closures in runtime. If you take a close look at binary’s microbenches (binary in general bit faster than cereal, so I focused on binary) you will discover, that it triggers GC even during simple tasks, such as reading a bunch of Word64 and summing results. And it relatively slow. Binary’s performance limited to hundreds of megabytes per second even in simple scenarios, but actual memory bandwidth should allow us process gigabytes per second. So, I decided to try a different approach.  

 

My approach is based on two assumptions so if it does not suit your case it obliviously will not work for you.  

 

First – when dealing with binary data we always know exactly how many bytes we want to read. This happens because we either read a fixed size primitive type or length-prefixed variable size data (possibly not “prefixed”, but you always know expected message length from somewhere). 

 

Second – chunks of data are continuous. I.e. we working with strict bytestrings more often. Even when you processing chunked stream of data with unknown number of messages (such as Mesos RecordIO Format) you should always be able to allocate continuous memory buffer for a single entity you want to deserialize. 

 

If it’s ok for you lets go on. First thing I’ve created was so called static reader. Static – because reader knows how mush bytes it wants to consume at compile time. Such readers suitable for reading combinations of primitive values. Here the implementation https://github.com/PROTEINE-INSAIDERS/lev-tolstoy/blob/master/src/Lev/Reader/Static.hs Static reader is somewhat monadic (you can use do-syntax with RebindableSyntax extension like here https://github.com/PROTEINE-INSAIDERS/lev-tolstoy/blob/master/bench/Bench/Lev/Reader/Static.hs ) but it actually indexed monad and it unfortunately incompatible with traditional monads. Offsets and sizes are calculates at type level during the compile time and according to microbences static reader does not introduced any overhead compared to hand-written reader using the low-level indexInt64OffAddr# function (https://github.com/PROTEINE-INSAIDERS/lev-tolstoy/blob/master/bench/Bench/Handwritten.hs ) 

 

Next thing was a dynamic reader. Dynamic readers are composed from static readers, they are fully monadic and allow you to read length-prefixed data structures (you read length with wrapped static reader and then perform normal monadic binding to supply a reader which reads a value). Here the implementation https://github.com/PROTEINE-INSAIDERS/lev-tolstoy/blob/master/src/Lev/Reader/Dynamic.hs  

 

I’ve discovered, that typeclasses are surprisingly fast. So, it’s completely ok to have something like “(Cursor c) => … Reader” where cursor contains functions for overflow checking and memory buffer progression. I was not able to achieve same performance when I passed buffer callbacks explicitly, possibly because of inlining of specialization magic. So, I've created ByteStringCursor ( https://github.com/PROTEINE-INSAIDERS/lev-tolstoy/blob/master/src/Lev/Reader/ByteString.hs ) which allows to read strict bytestrings. I suppose that cursors introduce flexibility: you may read strict bytestring or you may define custom cursor for reading custom memory buffers. 

 

Finally, I’ve written some microbenches for my cases and also ported one microbench from binary. And my implementation was able to run 10x-15x faster than binary on strict bytestring. Moreover, it does not trigger GC during the run at all! You may find the project and all the benches here https://github.com/PROTEINE-INSAIDERS/lev-tolstoy 

 

So, what's the purpose of my message? It’s quite unlikely that I’ll release my code on the Hackage. This is because I’m a hobbyist Haskell developer and doing it all in my free time. But before mr. Putin will close Internet in Russia I’d like to demonstrate that it’s possible to write very fast binary deserializer taking in account several assumptions which seems quite reasonably for me. So, feel free to borrow the ideas, or share the ideas, if you know how it could be implemented better.  


--
Sincerely, Stanislav Chernichkin.


--
Sincerely, Stanislav Chernichkin.