RE: utf8 strings: memory optimization and case-ignoring comparision

On 14 December 2005 20:35, Bulat Ziganshin wrote:
i use utf8-packed strings in my program and have to ask 2 questions about them:
1. i need function to do case-ignoring comparision of such strings. stricmp is not appropriate because it don't know about utf8. can be the existing Unicode support in Data.Char used for these or can the appropriate support will be added?
you should be able to use toUpper/toLower from Data.Char in GHC 6.4.1.
2. what is the most memory-efficient representaion for such strings? now i use John Meacham's library (http://repetae.net/john/repos/jhc/PackedString.hs) which declares:
newtype PackedString = PS (UArray Int Word8)
but this uses two Ints just to hold index bounds:
data UArray i e = UArray !i !i ByteArray#
I don't know why an extra 8/16 bytes per string is that worrying - if you have so many small strings perhaps you should be sharing them via a hash table?
i want to use just memory ptr and put NUL at the end of array (my strings never contain NUL chars). but what type i must use for this ptr? ByteArray/ByteArray#, ForeignPtr, StablePtr, Ptr?? and which function i must use to quickly allocate memory i need? my packed strings will be only unpacked and passed to "unsafe" C functions: stricmp, strcpy, strcat; i plan to not use any other operations
ForeignPtr and mallocForeignPtr are the way to go these days. In GHC 6.6 these will be much faster than before. Cheers, Simon

Hello Simon, Thursday, December 15, 2005, 12:44:35 PM, you wrote:
2. what is the most memory-efficient representaion for such strings?
data UArray i e = UArray !i !i ByteArray#
SM> I don't know why an extra 8/16 bytes per string is that worrying - if SM> you have so many small strings perhaps you should be sharing them via a SM> hash table? i also use hash table in another part of program, but in this list (it's a basenames of files on disk) 70% of strings are unique SM> ForeignPtr and mallocForeignPtr are the way to go these days. In GHC SM> 6.6 these will be much faster than before. can you please say what is a representation of ForeignPtr in 6.6? GHC 6.4.1 uses PinnedByteArray. as far as i can understand, there is only two alternatives - either use pinned arrays and have access to C functions which needs address of memory area, or use unpinned ByteArray and do all processing in Haskell? how pinned byte arrays work with garbage collector? are they allocated in special memory blocks so they don't alternate with movable data? can the memory, used by these arrays, be deallocated and then used again? -- Best regards, Bulat mailto:bulatz@HotPOP.com
participants (2)
-
Bulat Ziganshin
-
Simon Marlow