
#5218: Add unpackCStringLen# to create Strings from string literals -------------------------------------+------------------------------------- Reporter: tibbe | Owner: thoughtpolice Type: feature request | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.0.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #5877 #10064 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by jscholl): I tried implementing a {{{String#}}} type which would carry the length as an {{{Int#}}} at its beginning and two functions to extract the length and address of the string literal. However, it quickly got a little bit out of hand: - {{{unpackCString#}}} etc. had to be adopted, breaking backwards compatibility. To avoid this, I tried to create wrapper functions {{{unpackCStringLit#}}}, which would extract the address and call the original {{{unpackCString#}}} function. - I could not solve the question how to adopt the rewrite rules dealing with strings without duplicating them for the {{{Addr#}}} and {{{String#}}} versions. I could also not figure out when {{{unpackCStringLit#}}} should inline to avoid the overhead of the new address computation. - It took a while to find all (most?) of the places (library, some hardcoded types in {{{Id}}}s, a place in the type checker, generating record selector errors) where types where wired in, especially for exceptions like {{{absentError}}} and {{{recSelError}}}. - Implementing a new {{{String#}}} also asked the question whether {{{"foo"##}}} should be the corresponding literal for it. However, adding it from the parser to the backend seemed quite complex, so I tried a different approach. Instead of creating a new type {{{String#}}}, I rewrote {{{unpackCStringLit#}}} to have the type {{{Addr# -> Int# -> [Char]}}}. It would then just throw its second argument away and inline in some phase. However, it still meant duplicating rewrite rules, which seemed not like an idea solution. My next idea was to push the length information into an ignored argument to a function giving us the address: {{{cStringLitAddr# :: Addr# -> Int# -> Addr#}}}. This could just be passed as an argument to {{{unpackCString#}}}, thus I was quite confident that it would remain backwards compatible and no extra rewrite rules were needed to maintain the current behavior (but extra rules to use the length information, e.g. to construct bytestrings, but this seems like an acceptable cost). However, I did not anticipate the let/app invariant, thus my original design of {{{unpackCString# (cStringLitAddr# "foo"# 3#)}}} caused lint to warn me. After reading up about the invariant, I decided that {{{cStringLitAddr#}}}, applied to two literals, should be okay for speculation, as it did not have side effects nor could fail or anything. However, while now the generated core was accepted, it was useless, as it would not match the rewrite rules written by a user. Their rules would be translated to something like {{{case cStringLitAddr# addr len of { tmp -> unpackCString# tmp } }}}. Thus, I decided to generate matching core and removed my fix to make {{{cStringLitAddr#}}} okay for speculation. In the current version, it is possible to create a bytestring in O(1) with rewrite rules. However, I have broken the general list fusion (or at least the built-in rules {{{match_eq_string}}} and {{{match_append_lit}}}), as the case statement gets in the way between {{{foldr}}} and {{{build}}}, causing them to not be optimized out (but maybe this is generally a missed opportunity, if I have {{{foo (case something of { tmp -> bar tmp }) }}}, maybe it should be possible to rewrite {{{foo (bar x) = baz x}}} anyway, leading to {{{case something of { tmp -> baz tmp } }}}, iff {{{something}}} is safe to evaluate with regards to time, space and exceptions (this is okay-for- speculation, right?)). So right now I am stuck. Maybe it is okay to break backwards compatibility and just change the types of {{{unpackCString#}}} etc. to include an additional (ignored) {{{Int#}}} argument, pushing some #ifs to everyone using {{{unpackCString#}}} (I think this is basically text, bytestring and ghc itself) for the next few years. However, {{{unpackCString#}}} is called at some additional places, namely when constructing modules for {{{Typeable}}}. Right now the types only carry the {{{Addr#}}} to call it, but would then also need the length information (or there would be the risk that something rewrites it and gets a bogus length, if one just passes {{{0#}}} as length information). On the other hand, maybe it would be a good thing to actually pass the length along to {{{unpackCString#}}}, making it mandatory, as this would avoid the need to null-terminate the strings, allowing {{{'\NUL'}}} characters to be encoded with one byte instead of two (which may be of interest for bytestring). On the other hand, I could imagine this breaking stuff if strings are no longer null- terminated in subtle ways... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/5218#comment:41 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler