[GHC] #16052: Core optimizations for memset on a small range

15 Dec 2018

      #16052: Core optimizations for memset on a small range
-------------------------------------+-------------------------------------
           Reporter:  andrewthad     |             Owner:  (none)
               Type:  task           |            Status:  new
           Priority:  normal         |         Milestone:
          Component:  Compiler       |           Version:  8.6.3
           Keywords:                 |  Operating System:  Unknown/Multiple
       Architecture:                 |   Type of failure:  None/Unknown
  Unknown/Multiple                   |
          Test Case:                 |        Blocked By:
           Blocking:                 |   Related Tickets:
Differential Rev(s):                 |         Wiki Page:
-------------------------------------+-------------------------------------
 I've been doing some API bindings lately that require zeroing out memory
 before poking values into the appropriate places. Sometimes, these are
 small data structures. For instance, on linux, the internet socket struct
 `sockaddr_in` is 16 bytes. Here's an example (not involving `sockaddr_in`)
 of the kind of situation that arises:

 {{{
 {-# language MagicHash #-}
 {-# language UnboxedTuples #-}

 module FillArray
   ( fill
   ) where

 import GHC.Exts
 import GHC.IO

 data ByteArray = ByteArray ByteArray#

 fill :: IO ByteArray
 fill = IO $ \s0 -> case newByteArray# 24# s0 of
   (# s1, m #) -> case setByteArray# m 0# 24# 0# s1 of
     s2 -> case writeWord8Array# m 4# 14## s2 of
       s3 -> case writeWord8Array# m 5# 15## s3 of
         s4 -> case unsafeFreezeByteArray# m s4 of
           (# s5, r #) -> (# s5, ByteArray r #)
 }}}

 This `fill` function allocates a 24-byte array, sets everything to zero,
 and then writes the numbers 14 and 15 to elements 4 and 5 respectively.
 With `-O2`, here's the relevant part of the core we get:

 {{{
 fill1
 fill1
   = \ s0_a140 ->
       case newByteArray# 24# s0_a140 of { (# ipv_s16i, ipv1_s16j #) ->
       case setByteArray# ipv1_s16j 0# 24# 0# ipv_s16i of s2_a143
       { __DEFAULT ->
       case writeWord8Array# ipv1_s16j 4# 14## s2_a143 of s3_a144
       { __DEFAULT ->
       case writeWord8Array# ipv1_s16j 5# 15## s3_a144 of s4_a145
       { __DEFAULT ->
       case unsafeFreezeByteArray# ipv1_s16j s4_a145 of
       { (# ipv2_s16p, ipv3_s16q #) ->
       (# ipv2_s16p, ByteArray ipv3_s16q #)
       }
       }
       }
       }
       }
 }}}

 And, here's the relevant assembly:

 {{{
 fill1_info:
 _c1kL:
         addq $56,%r12
         cmpq 856(%r13),%r12
         ja _c1kP
 _c1kO:
         movq $stg_ARR_WORDS_info,-48(%r12)
         movq $24,-40(%r12)
         leaq -48(%r12),%rax
         subq $8,%rsp
         leaq 16(%rax),%rdi
         xorl %esi,%esi
         movl $24,%edx
         movq %rax,%rbx
         xorl %eax,%eax
         call memset
         addq $8,%rsp
         movb $14,20(%rbx)
         movb $15,21(%rbx)
         movq $ByteArray_con_info,-8(%r12)
         movq %rbx,(%r12)
         leaq -7(%r12),%rbx
         jmp *(%rbp)
 _c1kP:
         movq $56,904(%r13)
         movl $fill1_closure,%ebx
         jmp *-8(%r13)
         .size fill1_info, .-fill1_info
 }}}

 What a bummer that using `memset` for something as small setting three
 machine words (on a 64 bit platform) results in a `call` instruction
 getting generated. Why not simply generate three `movb` instructions for
 the zero initialization instead?

 Currently, users can work around this by translating their `setByteArray#`
 call to several `writeWordArray#` calls. This optimization obscures the
 meaning of written code and is not portable across architectures (so you
 have to use `CPP` to make it work on 32 bit and 64 bit). I'd like to add a
 cmm-to-assembly optimization to GHC that does unrolling instead so that
 the user can write more natural code.

 Specifically, here's what I'm thinking:

 * This only happens when the offset into the `ByteArray#` and the length
 of the range are constant that are multiples of the machine word size. So,
 `setByteArray# arr 8# 16# x` is eligible on 32-bit and 64-bit platforms.
 And `setByteArray# arr 4# 8# x` is eligible only on a 32-bit platform. And
 `setByteArray# arr 16# y x` is not eligible on any platform.
 * This only happens when the `call memset` instruction has a range of 32
 bytes or less.

-- 
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/16052
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler