primitive (byte) string literal with length?

Is there any GHC syntax for constructing a primitive string literal with a known (not hand coded) byte count? With `"some bytes"#` I get just the `Addr#` pointer, but not the size. If there's nothing available, would it be reasonable to introduce a new syntax? Perhaps: "some bytes"## :: (# Addr#, Int# #) -- Viktor.

Hi,
You can use cstringLength# which has a constant-folding rules for literals. That's what we use in GHC to build FastString literals.
Le 24 août 2021 à 06:34, à 06:34, Viktor Dukhovni
Is there any GHC syntax for constructing a primitive string literal with a known (not hand coded) byte count? With `"some bytes"#` I get just the `Addr#` pointer, but not the size.
If there's nothing available, would it be reasonable to introduce a new syntax? Perhaps:
"some bytes"## :: (# Addr#, Int# #)
-- Viktor.
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

On Tue, Aug 24, 2021 at 08:48:53AM +0200, Sylvain Henry wrote:
Le 24 août 2021 à 06:34, à 06:34, Viktor Dukhovni
a écrit: Is there any GHC syntax for constructing a primitive string literal with a known (not hand coded) byte count? With `"some bytes"#` I get just the `Addr#` pointer, but not the size.
If there's nothing available, would it be reasonable to introduce a new syntax? Perhaps:
"some bytes"## :: (# Addr#, Int# #)
You can use cstringLength# which has a constant-folding rules for literals. That's what we use in GHC to build FastString literals.
Sadly, that does not work when the primitive octet string contains internal NUL bytes. λ> :set -package ghc-prim λ> :set -XMagicHash λ> import GHC.CString λ> import GHC.Int λ> λ> I# (cstringLength# "foobar\xa0"#) 7 λ> I# (cstringLength# "foo\0bar\xa0"#) 3 -- Viktor.

On Tue, Aug 24, 2021 at 09:03:30AM -0400, Viktor Dukhovni wrote: I originally wrote:
Is there any GHC syntax for constructing a primitive string literal with a known (not hand coded) byte count? With `"some bytes"#` I get just the `Addr#` pointer, but not the size.
If there's nothing available, would it be reasonable to introduce a new syntax? Perhaps:
"some bytes"## :: (# Addr#, Int# #)
But neglected to mention that I knew about `cstringLength#`, but found it wanting, because it does not support octet-strings with embedded NUL characters:
Sadly, that does not work when the primitive octet string contains internal NUL bytes.
λ> :set -package ghc-prim λ> :set -XMagicHash λ> import GHC.CString λ> import GHC.Int λ> λ> I# (cstringLength# "foobar\xa0"#) 7 λ> I# (cstringLength# "foo\0bar\xa0"#) 3
If there isn't some other extant work-around, any feedback on my proposal of a new syntax for a primitive unboxed (address, length) pair: "some bytes"## :: (# Addr#, Int# #) -- Viktor.

There are bytearray literal proposals [1,2]. My older proposal [2] idea that the literal prim literal strings could generate also ByteArray# and (# Int#, Addr# #), but as it was proposing to change how literal Haskell Strings are compiled the proposal got stalled. The newew proposal [1] is tagged as "needs revision". It doesn't include(# Int#, Addr# #), but those are easy to get from ByteArray# which has negligible overhead. I haven't followed the discussion so I'm not sure what syntax it actually proposes (description and proposal text differ) and what are the needed revisions. I'm cc-ing Andew, he knows better :) - Oleg [1] https://github.com/ghc-proposals/ghc-proposals/pull/292 [2] https://github.com/ghc-proposals/ghc-proposals/pull/135 On 25.8.2021 18.31, Viktor Dukhovni wrote:
On Tue, Aug 24, 2021 at 09:03:30AM -0400, Viktor Dukhovni wrote:
I originally wrote:
Is there any GHC syntax for constructing a primitive string literal with a known (not hand coded) byte count? With `"some bytes"#` I get just the `Addr#` pointer, but not the size.
If there's nothing available, would it be reasonable to introduce a new syntax? Perhaps:
"some bytes"## :: (# Addr#, Int# #) But neglected to mention that I knew about `cstringLength#`, but found it wanting, because it does not support octet-strings with embedded NUL characters:
Sadly, that does not work when the primitive octet string contains internal NUL bytes.
λ> :set -package ghc-prim λ> :set -XMagicHash λ> import GHC.CString λ> import GHC.Int λ> λ> I# (cstringLength# "foobar\xa0"#) 7 λ> I# (cstringLength# "foo\0bar\xa0"#) 3 If there isn't some other extant work-around, any feedback on my proposal of a new syntax for a primitive unboxed (address, length) pair:
"some bytes"## :: (# Addr#, Int# #)

On Wed, Aug 25, 2021 at 07:05:58PM +0300, Oleg Grenrus wrote:
The newew proposal [1] is tagged as "needs revision". It doesn't include(# Int#, Addr# #), but those are easy to get from ByteArray# which has negligible overhead. [...] [1] https://github.com/ghc-proposals/ghc-proposals/pull/292
Yes, ByteArray# literals would work just as well for my needs. The one thing that's missing, from the proposed variants: Rather than adding new syntax, this proposal leverages an existing GHC extension: QuasiQuotes. Rather than using TemplateHaskell, these quasiquoters would be built in to the compiler. Here are some examples of ByteArray# literals under this scheme: [octets|fe01bce8|] -- ByteArray# (four bytes) [utf8|Araña|] -- ByteArray# (UTF-8) [utf16|Araña|] -- ByteArray# (UTF-16, native endian) [utf16le|Araña|] -- ByteArray# (UTF-16, little endian) [utf16be|Araña|] -- ByteArray# (UTF-16, big endian) is a syntax for octet-strings that does not force hex encoding of every byte, thus something along the lines of: [octetstr|foo%A0bar|] -- ByteArray# (seven bytes) The "%hh" hex octet could be "\hh" or "\xhh", ... whatever is deemed sufficiently natural/readable (perhaps "foo\xA0\&bar" for consistency with Haskell strings?). The "\xhh" form would be familiar to Python users: >>> x = b'foo\xA0bar' >>> len(x) 7 >>> x[3] 160 So, I support the proposal, even though quasi-quoters are more bulky than "somebytes"##, they have the advantage of supporting multiple variant formats. I might be tempted to use "octets" for the non-hex form with "%" or other escapes, and "hexstr" (or similar) for the hex form. -- Viktor.
participants (3)
-
Oleg Grenrus
-
Sylvain Henry
-
Viktor Dukhovni