List x ByteString x Lazy Bytestring

Hello Café, I've used Haskell and GHC to solve particular real life application. 4 tools were developed and their function is almost the same - they modify textual input according to patterns found in the text. Thus, it is something like a compiler, the result is also a text and it is not parsed to tokens as patterns appear on a different level. The tools differ in tasks and number of modifications performed, otherwise, in principal, they are very much similar. I used lists (Prelude, Data.List) to develop the tools. After successfully completing the development, I've started to optimize the code to make the tools faster. After modification of some algorithms (which dropped the processing time notably), I started to change data structures. I swapped lists with lazy bytestrings. Nevertheless, what an unpleasant surprise, the processing speed dropped down, significantly / more then 30% time needed). So my questions follow: - What kind of application is lazy bytestring suitable for? - Would it be worth using strict bytestring even if input files may be large? (They would fit in memory, but may consume whole) - If bytestring is not suitable for text manipulation, is there something faster than lists? - It would be nice to have native sort for lazy bytestring - would it be slower than pack $ Data.List.sort $ unpack ? - If bytestring is suitable for text manipulation could we have some hGetTextualContents which translates Windows EOL (CR+LF) to LF? I'm sorry if the answers are obvious. Could anyone point me to some article/pages I could read about? Thanks and regards, John -- http://www.fastmail.fm - A fast, anti-spam email service.

Hi,
- If bytestring is not suitable for text manipulation, is there something faster than lists? Have a look at the text package[1].
Cheers, Simon [1] http://hackage.haskell.org/package/text

However the performance issues seem odd: text is based on bytestring.
2011/12/5 Simon Hengel
Hi,
- If bytestring is not suitable for text manipulation, is there something faster than lists? Have a look at the text package[1].
Cheers, Simon
[1] http://hackage.haskell.org/package/text
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Mon, Dec 5, 2011 at 6:09 AM, Yves Parès
However the performance issues seem odd: text is based on bytestring.
This is not the case. Text is based on ByteArray#, GHC internal type for blocks of bytes. The text package depends on the bytestring package because it allows you to encode/decode Text<->ByteString. -- Johan

Oh, sorry, my bad.
I misunderstood the dependency.
2011/12/5 Johan Tibell
On Mon, Dec 5, 2011 at 6:09 AM, Yves Parès
wrote: However the performance issues seem odd: text is based on bytestring.
This is not the case. Text is based on ByteArray#, GHC internal type for blocks of bytes. The text package depends on the bytestring package because it allows you to encode/decode Text<->ByteString.
-- Johan

I wonder what profiling tells you, you should identify where your
performance problems actually are before trying to optimise.
Some things that might help are using something like Blaze-Builder[1] to
construct your bytestrings for output. I'm hoping that they're sufficiently
lazy that you can lazily read in the input, and write output as you've made
it available. if you use the blaze-builder-enumerator package, you should
be able to get exactly what you want (but probably requires some minor
knowledge of iteratees).
Anyway, without seeing your code, we can't easily tell you what's wrong.
Cheers,
Alex
[1] http://hackage.haskell.org/package/blaze-builder
[2] http://hackage.haskell.org/package/blaze-builder-enumerator
On 6 December 2011 02:11, Yves Parès
Oh, sorry, my bad. I misunderstood the dependency.
2011/12/5 Johan Tibell
On Mon, Dec 5, 2011 at 6:09 AM, Yves Parès
wrote: However the performance issues seem odd: text is based on bytestring.
This is not the case. Text is based on ByteArray#, GHC internal type for blocks of bytes. The text package depends on the bytestring package because it allows you to encode/decode Text<->ByteString.
-- Johan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- -- Alex Mason

Thank you for your replies and provided links. Unfortunately, I cannot provide the code as my boss thinks it could provide some extra information - definitely not Haskellish, but other. Nevetheless, your contribution was helpful and I was able to tune the code to get even better perofmance. Moreover, the performance is not big issue - I can process almost one million of files from 40 to 120 seconds, where sizes of files vary from several hundreds of bytes to several hundred of megabytes. I just thought I could improve overall tool-chain performance (several such tools are used in a chain). Thank you very much, I'l try the blaze-builder on the smallest tool and if any interesting outcome apperas I'll let you know. Best regards, John On Tue, Dec 6, 2011, at 03:27 PM, Axman wrote: I wonder what profiling tells you, you should identify where your performance problems actually are before trying to optimise. Some things that might help are using something like Blaze-Builder[1] to construct your bytestrings for output. I'm hoping that they're sufficiently lazy that you can lazily read in the input, and write output as you've made it available. if you use the blaze-builder-enumerator package, you should be able to get exactly what you want (but probably requires some minor knowledge of iteratees). Anyway, without seeing your code, we can't easily tell you what's wrong. Cheers, Alex [1] [1]http://hackage.haskell.org/package/blaze-builder [2] [2]http://hackage.haskell.org/package/blaze-builder-enumerato r On 6 December 2011 02:11, Yves Parès <[3]limestrael@gmail.com> wrote: Oh, sorry, my bad. I misunderstood the dependency. 2011/12/5 Johan Tibell <[4]johan.tibell@gmail.com> On Mon, Dec 5, 2011 at 6:09 AM, Yves Parès <[5]limestrael@gmail.com> wrote: However the performance issues seem odd: text is based on bytestring. This is not the case. Text is based on ByteArray#, GHC internal type for blocks of bytes. The text package depends on the bytestring package because it allows you to encode/decode Text<->ByteString. -- Johan _______________________________________________ Haskell-Cafe mailing list [6]Haskell-Cafe@haskell.org [7]http://www.haskell.org/mailman/listinfo/haskell-cafe -- -- Alex Mason _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe References 1. http://hackage.haskell.org/package/blaze-builder 2. http://hackage.haskell.org/package/blaze-builder-enumerator 3. mailto:limestrael@gmail.com 4. mailto:johan.tibell@gmail.com 5. mailto:limestrael@gmail.com 6. mailto:Haskell-Cafe@haskell.org 7. http://www.haskell.org/mailman/listinfo/haskell-cafe -- http://www.fastmail.fm - The professional email service

On Monday 05 December 2011, 14:14:56, John Sneer wrote:
I've used Haskell and GHC to solve particular real life application. 4 tools were developed and their function is almost the same - they modify textual input according to patterns found in the text. Thus, it
Hmm, modification can be a problem for ByteStrings, since it entails copying. That could be worse for strict BytStrings than lazy, if in the lazy ByteString you can reuse many chunks.
is something like a compiler, the result is also a text and it is not parsed to tokens as patterns appear on a different level.
The tools differ in tasks and number of modifications performed, otherwise, in principal, they are very much similar.
I used lists (Prelude, Data.List) to develop the tools. After successfully completing the development, I've started to optimize the code to make the tools faster. After modification of some algorithms (which dropped the processing time notably), I started to change data structures. I swapped lists with lazy bytestrings. Nevertheless, what an unpleasant surprise, the processing speed dropped down, significantly / more then 30% time needed).
Two main possibilities: 1. your algorithm isn't suited for ByteStrings 2. you're doing it wrong The above indicates 1., but without a more detailed description and/or code, it's impossible to tell.
So my questions follow: - What kind of application is lazy bytestring suitable for?
Anything that involves examining large sequences of bytes (or ASCII [latin1/other single-byte encoding] text) basically sequentially (it's not good if you have to jump forwards and backwards a lot and far). Also some types of modification of such data.
- Would it be worth using strict bytestring even if input files may be large? (They would fit in memory, but may consume whole)
Probably not, see above. But see above.
- If bytestring is not suitable for text manipulation, is there something faster than lists?
text has already been mentioned, but again, there are types of manipulation it's not well-suited for and where a linked list may be superior.
- It would be nice to have native sort for lazy bytestring - would it be slower than pack $ Data.List.sort $ unpack ?
The natural sort for ByteStrings would be a counting sort, O(alphabet size + length), so for long ByteStrings, it should be significantly faster than pack . sort . unpack, but for short ones, it would be significantly slower.
- If bytestring is suitable for text manipulation could we have some hGetTextualContents which translates Windows EOL (CR+LF) to LF?
Doing such a transformation would be kind of against the purpose of ByteStrings, I think. Isn't the point of ByteStrings to get the raw bytes as efficiently as possible?

Hello!
I've used Haskell and GHC to solve particular real life application. 4 tools were developed and their function is almost the same - they modify textual input according to patterns found in the text. Thus, it
Hmm, modification can be a problem for ByteStrings, since it entails copying. That could be worse for strict BytStrings than lazy, if in the lazy ByteString you can reuse many chunks.
I understand now, that is probably the point.
Two main possibilities: 1. your algorithm isn't suited for ByteStrings 2. you're doing it wrong
The above indicates 1., but without a more detailed description and/or code, it's impossible to tell.
Yes, it seems that the (1) is the point, because I split and re-build the bytestream many times during processing.
So my questions follow: - What kind of application is lazy bytestring suitable for?
Anything that involves examining large sequences of bytes (or ASCII [latin1/other single-byte encoding] text) basically sequentially (it's not good if you have to jump forwards and backwards a lot and far). Also some types of modification of such data.
- Would it be worth using strict bytestring even if input files may be large? (They would fit in memory, but may consume whole)
Probably not, see above. But see above.
- If bytestring is not suitable for text manipulation, is there something faster than lists?
text has already been mentioned, but again, there are types of manipulation it's not well-suited for and where a linked list may be superior.
- It would be nice to have native sort for lazy bytestring - would it be slower than pack $ Data.List.sort $ unpack ?
The natural sort for ByteStrings would be a counting sort, O(alphabet size + length), so for long ByteStrings, it should be significantly faster than pack . sort . unpack, but for short ones, it would be significantly slower.
- If bytestring is suitable for text manipulation could we have some hGetTextualContents which translates Windows EOL (CR+LF) to LF?
Doing such a transformation would be kind of against the purpose of ByteStrings, I think. Isn't the point of ByteStrings to get the raw bytes as efficiently as possible?
OK, thank you very much for explanation. Best regards, John -- http://www.fastmail.fm - Email service worth paying for. Try it for free

Hi John,
I've used Haskell and GHC to solve particular real life application. 4 tools were developed and their function is almost the same - they modify textual input according to patterns found in the text. Thus, it
Hmm, modification can be a problem for ByteStrings, since it entails copying. That could be worse for strict BytStrings than lazy, if in the lazy ByteString you can reuse many chunks.
I understand now, that is probably the point.
Two main possibilities: 1. your algorithm isn't suited for ByteStrings 2. you're doing it wrong
The above indicates 1., but without a more detailed description and/or code, it's impossible to tell.
Yes, it seems that the (1) is the point, because I split and re-build the bytestream many times during processing.
For splitting it might be interesting to have a look at 'attoparsec' (http://hackage.haskell.org/package/attoparsec), which are parser combinators specialized to bytestring's. If the splitting still leaves large enough chunks of the original input intact (large enough being around > 128 bytes), then you might be able to achieve a benefit. To use strict and lazy bytestrings efficiently, it helps a lot to know their internal representation (a slice of a char[] array and lists of slices of char[] arrays) and to think about what the different operations cost on this datastructure. (The source code of the library is actually not that hard to understand.) An obviously very expensive operation is concatenation (guaranteed copying for strict bytestrings at least a list traversal for lazy bytestrings). The blaze-builder library provides a type that supports O(1) concatenation of sequences of bytes and an efficient conversion to a lazy bytestring that has a large average chunk size. Large chunks are important to amortize the work spent on the boundary with the speedup gained due to the compact and cache efficient representation. If you keep using lists, but need a lot of concatenation using difference lists (http://hackage.haskell.org/package/dlist) also allows a O(1) concatenation. Note that the "cost" of having this O(1) concatenation is that the resulting list cannot be inspected, as long as it is in a form that supports O(1) concatenation. The same holds for the lazy bytestring builder provided by blaze-builder. Note that I'd be very curious, if you could achieve any further performance improvement using blaze-builder...which is no surprise, as I'm its author :-) Note also that the API has been cleaned up and will be published with the next release of the bytestring library. So just ignore the whole 'write' stuff. I should have separated it. good luck and best regards, Simon
participants (7)
-
Axman
-
Daniel Fischer
-
Johan Tibell
-
John Sneer
-
Simon Hengel
-
Simon Meier
-
Yves Parès