
On Thu 2008-05-29 18:45, Chad Scherrer wrote:
Jed Brown
writes: Uh, ByteString is Unicode-agnostic. ByteString.Char8 is not. So why not do IO with lazy ByteString and parse into your own representation (which might look a lot like StorableVector)?
One problem you might run into doing it this way is if a wide character is split between two different arrays. In that case you have to do some post-porcessing to put the pieces back together. More efficient, I think, if you could force a given alignment when reading in the lazy bytestring. But there's not a way to do that, is there?
Unless you are reading UTF-32, you won't know what alignment you want until you get there. If I remember correctly, the default block size is nicely aligned so that in practice you shouldn't have to worry about a chunk ending with weird alignment. However, such alignment issues shouldn't affect you unless you are using the internal interface. If you want fast indexing, you have to parse one character at a time anyway so you won't gain anything by unsafe casting (or memcpy) into your data structure. Jed