
Hello Duncan, Saturday, July 15, 2006, 8:04:26 PM, you wrote:
can you test that this implementation lines = split 0x0a is as fast as existing (long) ones both for Lazy and Strict ByteString?
It might actually be the other way around, that the split implementation could benefit from the work that went into the optimisation of the lines function. I spent quite some time trying to optimise the lines implementation, at least for the Lazy module. To get better performance it relies on the assumption that many lines fit into a chunk. That may not be true for uses of split in general. It's worth investigating.
well, you know this problem much deeper than me. so i'm shutting up :) although i can say that strict ByteString should benefit from your implementation too (both for lines and split, for obvious reasons) imho, Lazy.split should just use (map P.split) and then join lines that was split between adjacent blocks
Btw, you can run the benchmarks too, they are included in the fps repo.
also, is not it faster to use the following implementation: isSpaceWord8 = (spacesFlagsArray!)?
Benchmark it and tell us which is faster.
can my laziness be enough justification? :)
also, i propose to move getLine/getContents/putStr/interact/readFile-type functions into .Char8 modules (both for strict and lazy bytestrings), because these functions are encoding-dependent and work with texts (as opposite to hGet/hPut which works with raw binary data blocks).
Yes, getLine and putStrLn are encoding dependent (they know the encoding of '\n'). getContents, putStr, readfile, interact etc are encoding-independent, they're just the same as hGet/hPut, working on binary data blocks. Indeed putStr = hPut stdout.
they all work with text files, so they are also encoding-dependent (translating CR+LF to LF on windows). putStr is only exception, but it can be moved for company :) this will make clear distinction between functions using ByteString as raw sequence of bytes (hGet/hPut) and functions using ByteString as packed String representing text data
in particular, i tried to implement Lazy.hGetLines as 'hGetContents >>= lines' but it was impossible because 'lines' function is defined only in Lazy.Char8 module
Yes, that's the way it should be. And of course there is no need for hGetLines in the Lazy module since it is just hGetContents >>= lines In my opinion the hGetLines in the other module should be removed too as it's just a special case of what the Lazy module does.
it's also possible. but the situation when one ByteString implementation supports particular function while another don't imho is not very good. user should be able to switch between implementations w/o rewriting his entire program btw, you may be interested to know that i implemented in Streams lib mmapBinaryFile, based on the code from ByteString. it works both on Windows and Unix, using universal mmap API i described in letter to David Roundy -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com