[GHC] #11312: GHC inlining primitive string literals can affect program output

#11312: GHC inlining primitive string literals can affect program output -------------------------------------+------------------------------------- Reporter: RyanGlScott | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: Incorrect result Unknown/Multiple | at runtime Test Case: | Blocked By: Blocking: | Related Tickets: #11292 Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- First noted in #11292, this program, when compiled with `-O1` or higher: {{{#!hs {-# LANGUAGE MagicHash #-} module Main (main) where import GHC.Exts (Addr#, isTrue#) import GHC.Prim (eqAddr#) data A = A { runA :: Addr# } a :: A a = A "a"# main :: IO () main = print (isTrue# (eqAddr# (runA a) (runA a))) }}} will result in the following after inlining: {{{ Main.main2 = case GHC.Prim.tagToEnum# @ GHC.Types.Bool (GHC.Prim.eqAddr# "a"# "a"#) of _ [Occ=Dead] { GHC.Types.False -> GHC.Show.shows26; GHC.Types.True -> GHC.Show.shows24 } }}} As a result, there are two of the same string constant with different addresses, which causes `eqAddr#` to return `False`. If compiled without optimizations, `"a"#` is not inlined, and as a result, `eqAddr#` returns `True`. Two questions: 1. Is this okay semantics-wise? Or is this a necessary risk when working with primitive string literals, and should programmers judiciously use `{-# NOINLINE #-}` with them? 2. Is this okay from a code duplication standpoint? As Reid Barton noted in #11292, `"a"#` is duplicated due to inlining. In this example, not much is duplicated, but if it were a longer string constant, that could result in a noticeable increase in the object file size. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11312 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11312: GHC inlining primitive string literals can affect program output -------------------------------------+------------------------------------- Reporter: RyanGlScott | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Incorrect result | Unknown/Multiple at runtime | Test Case: Blocked By: | Blocking: Related Tickets: #11292 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Anyone who uses `eqAddr#` to compare string literals is asking for trouble. Really it should be harder to do. `eqAddr#` is a respectable function, but it is NOT respectable that `"foo"# :: Addr#`. We should have `String#`, and operations to compare the strings. Meanwhile, see #8472 for a better approach to sharing string literals. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11312#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11312: GHC inlining primitive string literals can affect program output -------------------------------------+------------------------------------- Reporter: RyanGlScott | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Incorrect result | Unknown/Multiple at runtime | Test Case: Blocked By: | Blocking: Related Tickets: #11292 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by RyanGlScott): To collect some of the comments you've made in #8472 and #11292, is this what you want to be done? 1. Introduce a new `String#` type, and make primitive string literals their values, e.g., `"foo"# :: String#`. 2. Allow top-level `String#`s, e.g., {{{#!hs a::Addr# = "foo"# }}} but not top-level `String#` computations, which would entail: * Modify the test `isUnLiftedType ty` in `SetLevels.lvlMFE`, which stops unlifted things getting floated to top level. * Similarly `Simplify.bindingOk`. * Make `CmmLint` check the new invariant. * The STG->Cmm code generator would need to generate some suitable `CmmData` stuff. 3. Introduce primitive operations that manipulate `String#`s (e.g., a Haskell equivalent of `strcmp`) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11312#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11312: GHC inlining primitive string literals can affect program output -------------------------------------+------------------------------------- Reporter: RyanGlScott | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Incorrect result | Unknown/Multiple at runtime | Test Case: Blocked By: | Blocking: Related Tickets: #11292 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by simonpj): * cc: ekmett, core-libraries-committee (added) Comment: Let's separate two things: * Top-level unboxed string literals: #8472 * Not using `Addr#` for string literals: this ticket. Here's my summary for this ticket, after talking to Simon M. * It's plain wrong to use `Addr#` as the type of a string literal. If we do so, there is no reliable way to compute equality for {{{ data T = MkT Addr# deriving( Eq ) }}} Since the `Addr#` might come from `malloc` or something, it must compare using equality on `Addr#`. But then there is no guarantee that `MkT "foo"# == MkT "foo"#`. * So we need a new type for unlifted string literals, say `String#`. It could be primitive, and that's what I'll assume for now. * Of course the underlying representation will be the same as `Addr#`. But there should be no operation `get :: String# -> Addr#` (except maybe in the IO monad), else it'd possible that `get "foo"#` might be not-equal to `get "foo"#`. * What operations do we need on `String#`? Presumably at least {{{ eqString# :: String# -> String# -> Int# -- Like eqChar# cmpString# :: String# -> String# -> Int# -- 3-way compare lenString# :: String# -> Int# -- Number of chars indexString# :: String# -> Int# -> Char# -- Get the ith char }}} * NB: I'm deliberately not saying that the string is null-terminated. That's be up to the implementation of `String#`, provided it offered the above operations. A better representation might be a record of a length and a blob of bytes. * Could `String#` simply be a `ByteArray#`? {{{ type String# = ByteArray# }}} After all, `ByteArray#` already has primops `sizeOfByteArray#` and `indexCharArray#`. We'd just need a way to have a statically-allocated `ByteArray#`, but that would be an excellent thing anyway. e.g. I believe that Happy mis-uses literal strings to allow it to build a statically- allocated array. Avoiding yet another primitive type would be a relief. See also #5218 and #9577 * I'm not sure about how Unicode plays with all of this. * This would be a potentially breaking change for any code using unboxed string literals. I'm copying the Core Libraries Committee I'd love someone to take this up. there -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11312#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11312: GHC inlining primitive string literals can affect program output -------------------------------------+------------------------------------- Reporter: RyanGlScott | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Incorrect result | Unknown/Multiple at runtime | Test Case: Blocked By: | Blocking: Related Tickets: #11292 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by ekmett): From a CLC perspective, I'd consider anything involving unboxed string literals to fall under the purview of the GHC developers. It is a very compiler-specific detail that doesn't leak into much user visible code, so whatever you think is cleanest should be good enough for us. As for the lack of an operation for `String# -> Addr#` that gets a bit messier. We already have {{{#!hs byteArrayContents# :: ByteArray# -> Addr# }}} replete with the caveat that it should only be used with pinned byte arrays and with the suggestion that `String# = ByteArray#` these are effectively pre-pinned byte array literals. The existence of that primitive wasn't an issue until now, but becomes one in the presence of unboxed string literals. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11312#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11312: GHC inlining primitive string literals can affect program output -------------------------------------+------------------------------------- Reporter: RyanGlScott | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Incorrect result | Unknown/Multiple at runtime | Test Case: Blocked By: | Blocking: Related Tickets: #11292 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by xnyhps): Fyi, the patch that I've been working on for #9577 already fixes this on Linux: the patch marks string literals as "mergeable" asm sections, so the assembler merges them and makes the addresses identical. I don't mean to say that introducing `String#` isn't an improvement. I'll try to have a look. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11312#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11312: GHC inlining primitive string literals can affect program output -------------------------------------+------------------------------------- Reporter: RyanGlScott | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Incorrect result | Unknown/Multiple at runtime | Test Case: Blocked By: | Blocking: Related Tickets: #11292 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by xnyhps): I hacked around with this a bit, these are my findings: * `ByteArray#` is represented as `StgArrBytes`, which includes a 1 word header and 1 word to indicate the size and is word aligned. This may be very convenient for things that want to use static `ByteArray#`s, but for usual string literals it sounds quite wasteful: on x86-64 it would require 24 bytes of padding to store a one-character string, instead of 2 for a zero-terminated string. (The alignment changes of #9577 would conflict with this, but it should still be possible to mark the sections as mergeable.) Considering that, and the existence of `byteArrayContents#`, I think it would be preferable to introduce a new primitive type `String#` that is used ''only'' for string literals. * Whichever different type `""#` gets, bootstrapping the compiler becomes tricky: the template for both Alex and Happy defines `""#` literals. What makes it worse is that this means stage 1 and stage 2 need a different template. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11312#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11312: GHC inlining primitive string literals can affect program output -------------------------------------+------------------------------------- Reporter: RyanGlScott | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Incorrect result | Unknown/Multiple at runtime | Test Case: Blocked By: | Blocking: Related Tickets: #11292 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): See #12585 for another report of this bug. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11312#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11312: GHC inlining primitive string literals can affect program output -------------------------------------+------------------------------------- Reporter: RyanGlScott | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Incorrect result | Unknown/Multiple at runtime | Test Case: Blocked By: | Blocking: Related Tickets: #11292 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by hvr): See #12757 for yet another instance/report, this time affecting pre- release GHC 8.0.2 snapshots -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11312#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11312: GHC inlining primitive string literals can affect program output -------------------------------------+------------------------------------- Reporter: RyanGlScott | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Incorrect result | Unknown/Multiple at runtime | Test Case: Blocked By: | Blocking: Related Tickets: #11292 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): And #12757 shows that having literal strings be of type `Addr#` can lead to seg-faults. Yikes. We really should fix this. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11312#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11312: GHC inlining primitive string literals can affect program output -------------------------------------+------------------------------------- Reporter: RyanGlScott | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Incorrect result | Unknown/Multiple at runtime | Test Case: Blocked By: | Blocking: Related Tickets: #11292 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by duncan): From the point of view of the bytestring package, what we would like is an O(1) way to convert from a constant/literal string to a pinned `ByteArray#` since we can build a `ByteString` from one of those. You may say why not just convert to a raw `Addr#` given that `ByteString` uses `ForeignPtr` which can use pointers. Well, long term we'd like to eliminate the use of `ForeignPtr` and use only pinned or unpinned `ByteArray#`s, and similarly in the long term `Text` will switch to UTF8 and it already uses `ByteArray` not `ForeignPtr`. So long term, to have O(1) `Text` or `ByteString` literals we'd need to be able to convert to `ByteArray#`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11312#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11312: GHC inlining primitive string literals can affect program output -------------------------------------+------------------------------------- Reporter: RyanGlScott | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Incorrect result | Unknown/Multiple at runtime | Test Case: Blocked By: | Blocking: Related Tickets: #11292 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by simonpj): Fair enough, but there are then three separate matters here 1. Top-level unboxed string literals: #8472 2. Disentangling `Addr#` and `String#` along the lines of comment:3 (this ticket) 3. Some kind of support for `ByteArray#` literals, as you ask. Perhaps (2) and (3) are related, beause perhaps a `String#` can ''be'' a static `ByteArray#`? Simon -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11312#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11312: GHC inlining primitive string literals can affect program output -------------------------------------+------------------------------------- Reporter: RyanGlScott | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Incorrect result | Unknown/Multiple at runtime | Test Case: Blocked By: | Blocking: Related Tickets: #11292, #5218 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * related: #11292 => #11292, #5218 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11312#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11312: GHC inlining primitive string literals can affect program output -------------------------------------+------------------------------------- Reporter: RyanGlScott | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: strings Operating System: Unknown/Multiple | Architecture: Type of failure: Incorrect result | Unknown/Multiple at runtime | Test Case: Blocked By: | Blocking: Related Tickets: #11292, #5218 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * keywords: => strings -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11312#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#11312: GHC inlining primitive string literals can affect program output -------------------------------------+------------------------------------- Reporter: RyanGlScott | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.3 Resolution: | Keywords: strings Operating System: Unknown/Multiple | Architecture: Type of failure: Incorrect result | Unknown/Multiple at runtime | Test Case: Blocked By: | Blocking: Related Tickets: #11292, #5218 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by sjakobi): * cc: sjakobi (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/11312#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC