
On Sat, Nov 20, 2021 at 12:49:08PM +0100, Andreas Klebinger wrote:
For the assembly I opened a ticket: https://gitlab.haskell.org/ghc/ghc/-/issues/20714
Thanks, much appreciated. Understood re redundant join points, though in the non-toy context the redundnat point code is noticeably larger. join { exit4 :: Addr# -> Word# -> State# RealWorld -> Maybe (Int64, ByteString) exit4 (ww4 :: Addr#) (ww5 :: Word#) (ipv :: State# RealWorld) = case touch# dt1 ipv of { __DEFAULT -> let { dt3 :: Int# dt3 = minusAddr# ww4 dt } in case ==# dt3 dt2 of { __DEFAULT -> jump exit1 ww2 wild dt dt1 dt2 cs dt3 ww5; 1# -> jump $wconsume cs (orI# ww2 dt3) ww5 } } } in join { exit5 :: Addr# -> Word# -> State# RealWorld -> Maybe (Int64, ByteString) exit5 (ww4 :: Addr#) (ww5 :: Word#) (w1 :: State# RealWorld) = case touch# dt1 w1 of { __DEFAULT -> let { dt3 :: Int# dt3 = minusAddr# ww4 dt } in case ==# dt3 dt2 of { __DEFAULT -> jump exit1 ww2 wild dt dt1 dt2 cs dt3 ww5; 1# -> jump $wconsume cs (orI# ww2 dt3) ww5 } } } in FWIW, these don't appear to be deduplicated, both result from the same conditional: `acc < q || acc == q && d < 5`. I need some way to make this compute a single boolean value without forking the continuation. There's a another source of code bloat that I'd like to run by you... In the WIP code for Lazy ByteString 'readInt', I started with: readInt !q !r = \ !s -> consume s False 0 where -- All done consume s@Empty !valid !acc = if valid then convert acc s else Nothing -- skip empty chunk consume (Chunk (BI.BS _ 0) cs) !valid !acc -- Recurse = consume cs valid acc -- process non-empty chunk consume s@(Chunk c@(BI.BS _ !len) cs) !valid !acc = case _digits q r c acc of Result used acc' | used <= 0 -- No more digits present -> if valid then convert acc' s else Nothing | used < len -- valid input not entirely digits -> let !c' = BU.unsafeDrop used c in convert acc' $ Chunk c' cs | otherwise -- try to read more digits -- Recurse -> consume cs True acc' Overflow -> Nothing Now _digits is the I/O loop I shared before, and the calling code gets inlined into that recursive loop with various join points. But the loop gets forked into multiple copies which are compiled separately, because there are two different recursive calls into "consume" that got compiled into separate "joinrec { ... }". So I tried instead: readInt !q !r = \ !s -> consume s False 0 where -- All done consume s@Empty !valid !acc = if valid then convert acc s else Nothing consume s@(Chunk c@(BI.BS _ !len) cs) !valid !acc = case _digits q r c acc of Result used acc' | used == len -- try to read more digits -- Recurse -> consume cs (valid || used > 0) acc' | used > 0 -- valid input not entirely digits -> let !c' = BU.unsafeDrop used c in convert acc' $ Chunk c' cs | otherwise -- No more digits present -> if valid then convert acc' s else Nothing Overflow -> Nothing But was slightly surprised to find even more duplication (3 copies instead of tw) of the I/O loop, because in the call: consume cs (valid || used > 0) acc' the boolean argument got floated out, giving: case valid of True -> consume cs True acc' _ -> case used > 0 of True -> consume cs True acc' _ -> consume cs False acc' and each of these then generates essentially the same code. To get the code to be emitted just once, I had to switch from a Bool "valid" to a bitwise "valid": readInt !q !r = \ !s -> consume s 0 0 where -- All done consume s@Empty !valid !acc = if valid /= 0 then convert acc s else Nothing consume s@(Chunk c@(BI.BS _ !len) cs) !valid !acc = case _digits q r c acc of Result used acc' | used == len -- try to read more digits -- Recurse -> consume cs (valid .|. used) acc' | used > 0 -- valid input not entirely digits -> let !c' = BU.unsafeDrop used c in convert acc' $ Chunk c' cs | otherwise -- No more digits present -> if valid /= 0 then convert acc' s else Nothing Overflow -> Nothing Is there some way for GHC to figure out to not float out such cheap computations? The 'Result' constructor is strict, so there's no cost to evaluating `used > 0`, and cloning the entire computation is I think the more unfortunate choice... Adding redundant BangPatterns on `Result !used !acc'` didn't make a difference. Switching to bitwise logical "or" finally produces just one copy of the loop. -- Viktor.