
Hi all, I'm working on a program right now that will involve embedding some static files inside my Haskell program as bytestrings. I've done this in the past with file-embed[1]. In this case, I have a strange requirement: I need to be able to modify the embedded data after the compiler has run. In general, I've gotten the code to work (and will soon be following up with a blog post explaining). But in this process, I ran into something I found interesting. If I run this program: {-# LANGUAGE TemplateHaskell #-} import Data.ByteString.Unsafe (unsafePackAddressLen) import Language.Haskell.TH.Syntax import qualified Data.ByteString as S main = do fromAddr <- unsafePackAddressLen 7 $(return $ LitE $ StringPrimL "123\0\&456") print fromAddr let fromStr = S.pack $ map (toEnum . fromEnum) $(return $ LitE $ StringL "123\0\&456") print fromStr I get the result: "123\192\128\&45" "123\NUL456" My understanding is that GHC is automatically escaping the non-ASCII characters in the strings. When I use StringL, GHC also automatically unescapes the bytes, so everything works as expected. In the case of StringPrimL, there is no point at which the unescaping occurs. I haven't been able to find the rules on how this escaping functions. Can anyone point it out to me? Also, for my case, it would be very useful to be able to bypass the escaping. I can easily do so with a post-process on the executable to embed raw bytes, but I'm wondering if there's some way to convince StringPrimL not to do any escaping. Thanks, Michael [1] http://hackage.haskell.org/package/file-embed

Michael Snoyman wrote:
main = do fromAddr <- unsafePackAddressLen 7 $(return $ LitE $ StringPrimL "123\0\&456") print fromAddr let fromStr = S.pack $ map (toEnum . fromEnum) $(return $ LitE $ StringL "123\0\&456") print fromStr
I get the result:
"123\192\128\&45" "123\NUL456"
Well, the haddocks for StringPrimL say: A primitive C-style string, type Addr# You obviously can't have a null byte in the middle of a C-style string. So GHC is replacing it with an invalid UTF-8 representation of a null byte, the best it can do under the circumstances. Then you just get those bytes back when you read them as a byte string. Yitz

On Sun, May 29, 2011 at 4:06 AM, Yitzchak Gale
Michael Snoyman wrote:
main = do fromAddr <- unsafePackAddressLen 7 $(return $ LitE $ StringPrimL "123\0\&456") print fromAddr let fromStr = S.pack $ map (toEnum . fromEnum) $(return $ LitE $ StringL "123\0\&456") print fromStr
I get the result:
"123\192\128\&45" "123\NUL456"
Well, the haddocks for StringPrimL say: A primitive C-style string, type Addr#
You obviously can't have a null byte in the middle of a C-style string. So GHC is replacing it with an invalid UTF-8 representation of a null byte, the best it can do under the circumstances. Then you just get those bytes back when you read them as a byte string.
That makes sense, thanks. I hadn't noticed that it was following UTF-8 escaping rules. Michael
participants (2)
-
Michael Snoyman
-
Yitzchak Gale