vector, alignment and SIMD through FFI

Hello Cafe, Recently I've been playing with the implementation of an algorithm, for which we already have highly-optimized implementations available (in plain C/C++ as well as OCaml with calls to C through FFI). The algorithm works on buffers/arrays/vectors/whatever you want to call it, which needs to be combined in certain ways. This can be highly optimized by using SIMD instructions (like the ones provides by several SSE versions). I'd like to get a to Haskell version which is comparable in efficiency as the existing versions, whilst remaining as 'functional' as possible. I don't mind jumping into some low-level C glue and FFI (using ccall or custom primops), but this should be limited. Currently I have something working (still highly unoptimized) using (unboxed) vectors from the vector package, using mutable versions within a well-contained ST environment in some places. One hot zone of the current version is combining several vectors, and the performance of this operation could be greatly improved by using SIMD instructions. There's one catch though: when using these, memory should be aligned on certain boundaries (16 byte in this case). First and foremost, to be able to pass my vectors to some C functions, I should change my code into using Storable vectors (which should be fine, I guess I can expect similar performance characteristics?). I couldn't find any information on alignment guarantees of these vectors though... Which is how I get to my question: are there any such guarantees? If not, are there any pointers to how to proceed with this? I guess tracking alignment at the type level should be the goal, you can find some code trying to explain my reasoning up to now at the end of this email. I have some issues with this: - I'd need to re-implement almost all vector operations, which seems stupid. - It doesn't actually work right now ;-) - It'd be nice to be able to encode 'compatible' alignment: as an example, a 16 byte aligned buffer is also 8 byte aligned... I hope the above explains somewhat my goal. Any thoughts & help on this would be very welcome! Thanks, Nicolas module Data.Vector.SIMD ( -- ... ) where import qualified Data.Vector.Storable as SV import Foreign.Storable (Storable, sizeOf) import Foreign.Ptr (Ptr, FunPtr) import Foreign.ForeignPtr (ForeignPtr, newForeignPtr) import System.IO.Unsafe (unsafePerformIO) class Alignment a where alignment :: a -> Int data A8Byte instance Alignment A8Byte where alignment _ = 8 data A16Byte instance Alignment A16Byte where alignment _ = 16 newtype Alignment a => SIMDVector a b = V (SV.Vector b) replicate :: (Alignment a, Storable b) => a -> Int -> b -> SIMDVector a b replicate a n b = V v where ptr = unsafePerformIO $ do v <- _mm_malloc n (alignment a) -- memset etc return v v = SV.unsafeFromForeignPtr0 ptr n -- These are 2 _stub versions of the procedures since xmminstr.h (or mm_malloc.h -- when using GCC) contains them as inline procedures which are not available -- as-is in a library. There should be some C module which exports -- _mm_malloc_stub and _mm_free_stub, which simply includes xmminstr.h and calls -- the underlying procedures. foreign import ccall "_mm_malloc_stub" _mm_malloc_stub :: Int -> Int -> IO (Ptr a) foreign import ccall "_mm_free_stub" _mm_free_stub :: FunPtr (Ptr a -> IO ()) _mm_malloc :: Storable a => Int -> Int -> IO (ForeignPtr a) _mm_malloc s l = do -- This fails: -- Ambiguous type variable `a0' in the constraint: -- (Storable a0) arising from a use of `sizeOf' -- v <- c_mm_malloc (s * sizeOf (undefined :: a)) l newForeignPtr _mm_free_stub undefined -- This allocates a 16 byte aligned output buffer, takes 2 existing ones and -- calls some FFI function to perform some magic. -- The implementation could run inside ST, if the FFI import (which e.g. works -- on a mutable buffer and returns IO ()) is lifted into ST using unsafeIOToST mySIMDFun :: SIMDVector A16Byte a -> SIMDVector A16Byte a -> SIMDVector A16Byte a mySIMDFun a b = undefined

On Fri, Jul 6, 2012 at 1:06 PM, Nicolas Trangez
-- This fails: -- Ambiguous type variable `a0' in the constraint: -- (Storable a0) arising from a use of `sizeOf'
Here you can either tie a type knot using proxy types or you can use the scoped type variable language extension. Perhaps I'm missing something specific to your use, but for the alignment issue you should be OK just calling allocBytes or one of its variants. I made some noise about this a bit ago and it resulted in some extra words in the report under mallocBytes: """ The block of memory is sufficiently aligned for any of the basic foreign types that fits into a memory block of the allocated size. """ Which I'm pretty sure GHC did, and still does, follow. Cheers, Thomas

On Fri, Jul 6, 2012 at 1:43 PM, Thomas DuBuisson wrote: The block of memory is sufficiently aligned for any of the basic
foreign types that fits into a memory block of the allocated size. That's not the same thing as a guarantee of 16-byte alignment, note, as
none of the standard foreign types have that requirement.

On Fri, 2012-07-06 at 13:43 -0700, Thomas DuBuisson wrote:
On Fri, Jul 6, 2012 at 1:06 PM, Nicolas Trangez
wrote: -- This fails: -- Ambiguous type variable `a0' in the constraint: -- (Storable a0) arising from a use of `sizeOf'
Here you can either tie a type knot using proxy types or you can use the scoped type variable language extension.
Guess I'll have to do some reading ;-) Thanks.
Perhaps I'm missing something specific to your use, but for the alignment issue you should be OK just calling allocBytes or one of its variants. I made some noise about this a bit ago and it resulted in some extra words in the report under mallocBytes:
""" The block of memory is sufficiently aligned for any of the basic foreign types that fits into a memory block of the allocated size. """
Which I'm pretty sure GHC did, and still does, follow.
Hmh... as far as I could find, mallocBytes basically does what malloc(3) does, which is 8-byte alignment if I'm not mistaken on my x86-64 Linux system. I could use those and the over-allocate-and-offset tricks, but... that's ugly unless strictly necessary ;-) Normally posix_memalign or memalign or valloc or _mm_malloc should provide what I need as-is. Except, when using those and vector's unsafeFromForeignPtr0, all I get is a "Vector a", which no longer has any alignment information in the type, so I can't write a function which only accepts N-aligned vectors. As a result, I'd need to be very careful only to pass aligned vectors to it (checking manually), add code to handle pre/post-alignment bytes in my SIMD functions (slow and stupid), or live with it and let my application crash at random. I found some work by Oleg Kiselyov and Chung-chieh Shan at [1] which might be related, yet as of now I feel like that's too general for my purpose (e.g. I don't see how to integrate it with vector). Thanks, Nicolas [1] http://okmij.org/ftp/Haskell/types.html#ls-resources
participants (3)
-
Bryan O'Sullivan
-
Nicolas Trangez
-
Thomas DuBuisson