Re[2]: [GHC] #710: library reorganisation

Hello GHC, {- answering message in ghc-bugs -} Thursday, April 27, 2006, 2:06:07 PM, you wrote:
#710: library reorganisation
Some libraries we want to add:
* [http://www.cse.unsw.edu.au/~dons/fps.html FastPackedStrings] (replace {{{Data.PackedString}}})
sorry for repetition, but ByteString library in its current state still don't replaces PackedString in functionality, because it don't support full Unicode range of chars -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin
sorry for repetition, but ByteString library in its current state still don't replaces PackedString in functionality, because it don't support full Unicode range of chars
What would be required for it to replace PackedString? (If that is a goal?) If I understand correctly, PS is an array of Word32, and ByteString is (obviously) an array of Word8. Would it be sufficient if there was a 'Char' interface supporting all of Unicode (the obvious candidate encoding being UTF-8), or must it support UCS-2 and UCS-4 directly? -k -- If I haven't seen further, it is by standing in the footprints of giants

Hello Ketil, Friday, April 28, 2006, 10:52:17 AM, you wrote:
Bulat Ziganshin
writes:
sorry for repetition, but ByteString library in its current state still don't replaces PackedString in functionality, because it don't support full Unicode range of chars
What would be required for it to replace PackedString? (If that is a goal?) If I understand correctly, PS is an array of Word32, and ByteString is (obviously) an array of Word8. Would it be sufficient if there was a 'Char' interface supporting all of Unicode (the obvious candidate encoding being UTF-8), or must it support UCS-2 and UCS-4 directly?
IMHO, because PackedString is anyway abstract and DON'T support any way to see it's internal representation, any implementation that supports full unicode range, would be enough. it may be ucs4, utf8, or even ByteString+String (that selects representation depending on presence of non-Latin1 characters in string) support of specific encodings, such as utf16 or ucs4, will be great for special purposes (as i said, ucs4 allows fastest processing with support for full Unicode range, while utf16 is great for working directly with windows filenames), but that is another question. i just want to point that omitting PackedString may create problems for the people that use it one more suggestion about standardizing modules: ByteString gives access to Word8 strings and it provides ability to work with memory regions directly PackedString should just allow to work with some compact String representation without any information about it's internal representation, so it can be implemented using any above-given approach PackedString.UTF8 should work with memory areas that contains UTF8-packed strings and give direct access to such memory areas PackedString.UTF16 should work with memory areas that contains UTF16-packed strings and, again, give direct access to such memory areas The same for PackedString.UCS4... So, PackedString implementation should just import one of these modules and then rexxport only functions that are representation-independent. If someone just wants fast and compact strings, he should import this module, if he want some specific representation, he should import representation-specific module -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin
IMHO, because PackedString is anyway abstract and DON'T support any way to see it's internal representation, any implementation that supports full unicode range, would be enough.
Perhaps I'm misrepresenting FPS here, but from my POV, the representation is very much the issue. I see the typical use for FPS to be a case where you have some data (files, network buffers, whatever), which is, essentially, a string of bytes. The Char interface(s) to FPS is, in theory, encoding agnostic, but in practice, it will be limited by the underlying encoding. IMO, that is okay, as long as the interesting encodings are supported. Note that ByteString.UTF8 is *not* going to be a replacement for the other encoding-specific modules, since that would mean you would have to do an (expensive) conversion of non-UTF8 data. The current scheme allows you to work with a universal Unicode interface (based on Char), but keeping the data in its 'native' representation. The question is how to extend this to muliti-byte fixed encodings (UCS-2 and UCS-4), and variable encodings (UTF-8, UTF-16, UTF-32, Shift-JIS, and why not Quoted-Printable?). I feel confident it can be done, but it is likely to involve some policy decisions and trade offs. -k PS: I implemented another single-byte encoding, Windows-1252. This builds everything from a charset table, and while currently not too efficient, should make it very easy to add other single-byte encodings. As usual: darcs get http://www.ii.uib.no/~ketil/src/fps-i18n -- If I haven't seen further, it is by standing in the footprints of giants

Bulat Ziganshin wrote:
sorry for repetition, but ByteString library in its current state still don't replaces PackedString in functionality, because it don't support full Unicode range of chars
That's true, but frankly the current Data.PackedString is subsumed by [Char], so there's no good reason to keep it. I'll stop saying that Data.ByteString is replacing Data.PackedString to avoid confusion. Would you like to write a Data.ByteString.UTF8 to fill the gap? Cheers, Simon

Hello Simon, Friday, April 28, 2006, 1:34:05 PM, you wrote:
Bulat Ziganshin wrote:
sorry for repetition, but ByteString library in its current state still don't replaces PackedString in functionality, because it don't support full Unicode range of chars
That's true, but frankly the current Data.PackedString is subsumed by [Char], so there's no good reason to keep it. I'll stop saying that Data.ByteString is replacing Data.PackedString to avoid confusion.
this module may be used by someone, though... as i already said, you can with the same success omit any other modules that is also not replaced by ByteString - what is the difference?
Would you like to write a Data.ByteString.UTF8 to fill the gap?
no, at least now. btw, i have my own private simple module that is called UTF8Z because it don't store length but uses 0 byte at the end of string. but this module omits any functionality except for pack/unpack. on the other side, it's representation is most compact among the all modules i've seen -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (3)
-
Bulat Ziganshin
-
Ketil Malde
-
Simon Marlow