vector-simd: some code available, and some questions

All, After my message of yesterday [1] I got down to it and implemented something along those lines. I created a playground repository containing the code at [2]. Initial benchmark results at [3]. More about the benchmark at the end of this email. First some questions and requests for help: - I'm stuck with a typing issue related to 'sizeOf' calculation at [4]. I tried a couple of things, but wasn't able to figure out how to fix it. - I'm using unsafePerformIO at [5], yet I'm not certain it's OK to do so. Are there better (safer/performant/...) ways to get this working? - Currently Alignment phantom types (e.g. A8 and A16) are not related to each other: a function (like Data.Vector.SIMD.Algorithms.unsafeXorSSE42) can have this signature: unsafeXorSSE42 :: Storable a => SV.Vector SV.A16 a -> SV.Vector SV.A16 a -> SV.Vector SV.A16 a Yet, imaging I'd have an "SV.Vector SV.A32 Word8" vector at hand, the function should accept it as well (a 32-byte aligned vector is also 16-byte aligned). Is there any way to encode this at the type level? That's about it :-) As of now, I only implemented a couple of the vector API functions (the ones required to execute my benchmark). Adding the others should be trivial. The benchmark works with Data.Vector.{Unboxed|Storable}.Vector (UV and SV) vectors of Word8 values, as well as my custom Data.Vector.SIMD.Vector type (MV) using 16-byte alignment (MV.Vector MV.A16 Word8). benchUV, benchSV and benchMV all take 2 pre-calculated Word8 vectors of given size (1024 and 4096) and xor them pairwise into the result using "zipWith xor". benchMVA takes 2 suitable MV vectors and xor's them into a third using a rather simple and unoptimized C implementation using SSE4.2 intrinsics [6]. This could be enhanced quite a bit (I guess using the prim calling convention, FFI overhead can be reduced as well). Currently, only vectors of a multiple of 32 bytes are supported (mostly because of laziness on my part). As you can see, the zipWith Data.Vector.SIMD implementation is slightly slower than the Data.Vector.Storable based one. I didn't perform much profiling yet, but I suspect allocation and ForeignPtr creation is to blame, this seems to be highly optimized in GHC.ForeignPtr.mallocPlainForeignPtrBytes as used by Data.Vector.Storable. Thanks for any input, Nicolas [1] http://www.haskell.org/pipermail/haskell-cafe/2012-July/102167.html [2] https://github.com/NicolasT/vector-simd/ [3] http://linode2.nicolast.be/files/vector-simd-xor1.html [4] https://github.com/NicolasT/vector-simd/blob/master/src/Data/Vector/SIMD/Alg... [5] https://github.com/NicolasT/vector-simd/blob/master/src/Data/Vector/SIMD/Alg... [6] https://github.com/NicolasT/vector-simd/blob/master/cbits/vector-simd.c#L47

On Sat, Jul 7, 2012 at 9:13 PM, Nicolas Trangez
- Currently Alignment phantom types (e.g. A8 and A16) are not related to each other: a function (like Data.Vector.SIMD.Algorithms.unsafeXorSSE42) can have this signature:
unsafeXorSSE42 :: Storable a => SV.Vector SV.A16 a -> SV.Vector SV.A16 a -> SV.Vector SV.A16 a
Yet, imaging I'd have an "SV.Vector SV.A32 Word8" vector at hand, the function should accept it as well (a 32-byte aligned vector is also 16-byte aligned). Is there any way to encode this at the type level?
If the set of alignments you care about is finite, you could do: class AlignedToAtLeast n a instance AlignedToAtLeast A1 A1 instance AlignedToAtLeast A4 A1 instance AlignedToAtLeast A4 A4 instance AlignedToAtLeast A8 A1 instance AlignedToAtLeast A8 A4 instance AlignedToAtLeast A8 A8 instance AlignedToAtLeast A16 A1 instance AlignedToAtLeast A16 A4 instance AlignedToAtLeast A16 A8 instance AlignedToAtLeast A16 A16 instance AlignedToAtLeast A32 A1 instance AlignedToAtLeast A32 A4 instance AlignedToAtLeast A32 A8 instance AlignedToAtLeast A32 A16 instance AlignedToAtLeast A32 A32 in which as you can see the numbers of instances grows super-linearly with the number of alignments, but if there's only a handful it's not that bad. Then: unsafeXorSSE42 :: (Storable a, AlignedToAtLeast A16 align) => SV.Vector align a -> SV.Vector align a -> SV.Vector align a A problem with the above is that third parties can add more instances, breaking the safety. If this is a concern you could do: class AlignedToAtLeastImpl n a -- do not export this class! ...same instances as above... class AlignedToAtLeastImpl n a => AlignedToAtLeast n a -- export this class instance AlignedToAtLeastImpl n a => AlignedToAtLeast n a The drawback is that it requires UndecidableInstances, and (what bothers me more) the list of instances won't be present in the haddocks. So instead of the single instance above you could write all of them again manually for the exported class, which has the drawback that you have to write all of them again manually, but not the other two. An alternative solution is to encode all of the alignments in unary, which is more general; if they're all going to be a power of two you can "store" just the logarithm: data One data Twice n -- not practical to call it Double :) class AlignedToAtLeast n a instance AlignedToAtLeast One One instance AlignedToAtLeast One (Twice a) instance AlignedToAtLeast n a => AlignedToAtLeast (Twice n) (Twice a) type A1 = One type A4 = Twice (Twice A1) type A8 = Twice A4 type A16 = Twice A8 type A32 = Twice A16 and you can apply the same private class thing from above if you want. -- Your ship was caught in a monadic eruption.

On Sat, Jul 7, 2012 at 9:59 PM, Gábor Lehel
class AlignedToAtLeast n a instance AlignedToAtLeast A1 A1 instance AlignedToAtLeast A4 A1 instance AlignedToAtLeast A4 A4 instance AlignedToAtLeast A8 A1 instance AlignedToAtLeast A8 A4 instance AlignedToAtLeast A8 A8 instance AlignedToAtLeast A16 A1 instance AlignedToAtLeast A16 A4 instance AlignedToAtLeast A16 A8 instance AlignedToAtLeast A16 A16 instance AlignedToAtLeast A32 A1 instance AlignedToAtLeast A32 A4 instance AlignedToAtLeast A32 A8 instance AlignedToAtLeast A32 A16 instance AlignedToAtLeast A32 A32
Oh dang it. Sorry. All of these should be the other way around. Proofreading doesn't help if it's your brain that's fried. -- Your ship was caught in a monadic eruption.

On Sat, 2012-07-07 at 21:59 +0200, Gábor Lehel wrote:
An alternative solution is to encode all of the alignments in unary, which is more general; if they're all going to be a power of two you can "store" just the logarithm:
data One data Twice n -- not practical to call it Double :)
class AlignedToAtLeast n a instance AlignedToAtLeast One One instance AlignedToAtLeast One (Twice a) instance AlignedToAtLeast n a => AlignedToAtLeast (Twice n) (Twice a)
type A1 = One type A4 = Twice (Twice A1) type A8 = Twice A4 type A16 = Twice A8 type A32 = Twice A16
and you can apply the same private class thing from above if you want.
Very ingenious, thanks! I pushed this into [1], although export lists of all modules most likely will need some love once things get into shape. This also allows functions to become more general: unsafeXorSSE42 :: (Storable a, SV.AlignedToAtLeast SV.A16 o1, SV.Alignment o1, SV.AlignedToAtLeast SV.A16 o2, SV.Alignment o2, SV.AlignedToAtLeast SV.A16 o3, SV.Alignment o3) => SV.Vector o1 a -> SV.Vector o2 a -> SV.Vector o3 a I wonder whether GHC's upcoming type-level numerals could be useful in this situation as well. Nicolas [1] https://github.com/NicolasT/vector-simd/commit/a4f13745eb24d87a3628af13109f3...

On Sun, Jul 8, 2012 at 12:21 AM, Nicolas Trangez
On Sat, 2012-07-07 at 21:59 +0200, Gábor Lehel wrote:
An alternative solution is to encode all of the alignments in unary, which is more general; if they're all going to be a power of two you can "store" just the logarithm:
data One data Twice n -- not practical to call it Double :)
class AlignedToAtLeast n a instance AlignedToAtLeast One One instance AlignedToAtLeast One (Twice a) instance AlignedToAtLeast n a => AlignedToAtLeast (Twice n) (Twice a)
type A1 = One type A4 = Twice (Twice A1) type A8 = Twice A4 type A16 = Twice A8 type A32 = Twice A16
and you can apply the same private class thing from above if you want.
Very ingenious, thanks! I pushed this into [1], although export lists of all modules most likely will need some love once things get into shape.
This also allows functions to become more general:
unsafeXorSSE42 :: (Storable a, SV.AlignedToAtLeast SV.A16 o1, SV.Alignment o1, SV.AlignedToAtLeast SV.A16 o2, SV.Alignment o2, SV.AlignedToAtLeast SV.A16 o3, SV.Alignment o3) => SV.Vector o1 a -> SV.Vector o2 a -> SV.Vector o3 a
I wonder if you could get that a bit shorter... I suppose you could write: instance (AlignedToAtLeast n a, AlignedToAtLeast n b) => AlignedToAtLeast n (a, b) instance (AlignedToAtLeast n a, AlignedToAtLeast n b, AlignedToAtLeast n c) => AlignedToAtLeast n (a, b, c) ...and so on... though it feels a little strange semantically (relating a tuple to a scalar), but I don't see what harm can come of it. And then you can just write SV.AlignedToAtLeast SV.A16 (o1, o2, o3) in signatures. You can also make (Alignment n, Alignment a) a superclass constraint of AlignedToAtLeast, and write instances for Alignment inductively on One and Twice, and then you don't have to write Alignment o1 etc. separately either. So the signature would be just: unsafeXorSSE42 :: (Storable a, SV.AlignedToAtLeast SV.A16 (o1, o2, o3)) => SV.Vector o1 a -> SV.Vector o2 a -> SV.Vector o3 a which is friendlier.
I wonder whether GHC's upcoming type-level numerals could be useful in this situation as well.
I'd guess that this is what they're made for, but I haven't tried them.

On Sun, 2012-07-08 at 01:40 +0200, Gábor Lehel wrote:
unsafeXorSSE42 :: (Storable a,
SV.AlignedToAtLeast SV.A16 o1, SV.Alignment o1, SV.AlignedToAtLeast SV.A16 o2, SV.Alignment o2, SV.AlignedToAtLeast SV.A16 o3, SV.Alignment o3) => SV.Vector o1 a -> SV.Vector o2 a -> SV.Vector o3 a
I wonder if you could get that a bit shorter... I suppose you could write:
instance (AlignedToAtLeast n a, AlignedToAtLeast n b) => AlignedToAtLeast n (a, b) instance (AlignedToAtLeast n a, AlignedToAtLeast n b, AlignedToAtLeast n c) => AlignedToAtLeast n (a, b, c) ...and so on...
Once again, nifty! And implemented in [1].
though it feels a little strange semantically (relating a tuple to a scalar), but I don't see what harm can come of it. And then you can just write SV.AlignedToAtLeast SV.A16 (o1, o2, o3) in signatures.
You can also make (Alignment n, Alignment a) a superclass constraint of AlignedToAtLeast, and write instances for Alignment inductively on One and Twice, and then you don't have to write Alignment o1 etc. separately either. So the signature would be just:
unsafeXorSSE42 :: (Storable a, SV.AlignedToAtLeast SV.A16 (o1, o2, o3)) => SV.Vector o1 a -> SV.Vector o2 a -> SV.Vector o3 a
which is friendlier.
I implemented the inductive alignment calculation over One and Twice (good idea, and easy to do), but I don't get the thing about superclasses. I've been trying several approaches (including definitions based on forall and other trickery I never used before), but didn't get things to work, at least: the compiler always said I'd need UndecidableInstances, and that sounds scary... Care to elaborate? Thanks! Nicolas [1] https://github.com/NicolasT/vector-simd/commit/aedf25460b410e04a3d103befea59...

On Sun, Jul 8, 2012 at 3:05 AM, Nicolas Trangez
On Sun, 2012-07-08 at 01:40 +0200, Gábor Lehel wrote:
unsafeXorSSE42 :: (Storable a,
SV.AlignedToAtLeast SV.A16 o1, SV.Alignment o1, SV.AlignedToAtLeast SV.A16 o2, SV.Alignment o2, SV.AlignedToAtLeast SV.A16 o3, SV.Alignment o3) => SV.Vector o1 a -> SV.Vector o2 a -> SV.Vector o3 a
I wonder if you could get that a bit shorter... I suppose you could write:
instance (AlignedToAtLeast n a, AlignedToAtLeast n b) => AlignedToAtLeast n (a, b) instance (AlignedToAtLeast n a, AlignedToAtLeast n b, AlignedToAtLeast n c) => AlignedToAtLeast n (a, b, c) ...and so on...
Once again, nifty! And implemented in [1].
though it feels a little strange semantically (relating a tuple to a scalar), but I don't see what harm can come of it. And then you can just write SV.AlignedToAtLeast SV.A16 (o1, o2, o3) in signatures.
You can also make (Alignment n, Alignment a) a superclass constraint of AlignedToAtLeast, and write instances for Alignment inductively on One and Twice, and then you don't have to write Alignment o1 etc. separately either. So the signature would be just:
unsafeXorSSE42 :: (Storable a, SV.AlignedToAtLeast SV.A16 (o1, o2, o3)) => SV.Vector o1 a -> SV.Vector o2 a -> SV.Vector o3 a
which is friendlier.
I implemented the inductive alignment calculation over One and Twice (good idea, and easy to do), but I don't get the thing about superclasses. I've been trying several approaches (including definitions based on forall and other trickery I never used before), but didn't get things to work, at least: the compiler always said I'd need UndecidableInstances, and that sounds scary... Care to elaborate?
All I meant was class (Alignment n, Alignment a) => AlignedToAtLeast n a but I got a bit ahead of myself, because that rules out the instance on tuples. (I suppose you *could* write some kind of Alignment instance for them, taking their minimum or something, but that's getting a bit too subversive for me). The alternative, if you want both Alignment as a superclass and the ability to constrain multiple types at once, is to have the above, remove the instance on tuples, and instead something like: class (AlignedToAtLeast n a, AlignedToAtLeast n b) => AlignedToAtLeast2 n a b instance (AlignedToAtLeast n a, AlignedToAtLeast n b) => AlignedToAtLeast2 n a b class (AlignedToAtLeast n a, AlignedToAtLeast n b, AlignedToAtLeast n c) => AlignedToAtLeast3 n a b c instance (AlignedToAtLeast n a, AlignedToAtLeast n b, AlignedToAtLeast n c) => AlignedToAtLeast3 n a b c (feel free to think of better names!) unsafeXorSSE42 :: (Storable a, SV.AlignedToAtLeast3 SV.A16 o1 o2 o3) => SV.Vector o1 a -> SV.Vector o2 a -> SV.Vector o3 a That will require UndecidableInstances, but all that means is that GHC can't prove to itself that instance checking will terminate. So you could end up getting the compiler into an infinite loop (or in practice, to exceed its recursion limit). But it doesn't allow anything unsafe to happen at runtime, and there's plenty of perfectly good instances which terminate even if GHC can't prove it.

On Sun, 2012-07-08 at 10:27 +0200, Gábor Lehel wrote:
On Sun, Jul 8, 2012 at 3:05 AM, Nicolas Trangez
wrote: I implemented the inductive alignment calculation over One and Twice (good idea, and easy to do), but I don't get the thing about superclasses. I've been trying several approaches (including definitions based on forall and other trickery I never used before), but didn't get things to work, at least: the compiler always said I'd need UndecidableInstances, and that sounds scary... Care to elaborate?
All I meant was
class (Alignment n, Alignment a) => AlignedToAtLeast n a
but I got a bit ahead of myself, because that rules out the instance on tuples.
Maybe that's one of the things I ran into, can't remember OTOH.
(I suppose you *could* write some kind of Alignment instance for them, taking their minimum or something, but that's getting a bit too subversive for me).
Heh, it already feels evil-but-in-a-good-and-powerful-way to me right now ;-) Thanks for all your input!
The alternative, if you want both Alignment as a superclass and the ability to constrain multiple types at once, is to have the above, remove the instance on tuples, and instead something like:
class (AlignedToAtLeast n a, AlignedToAtLeast n b) => AlignedToAtLeast2 n a b instance (AlignedToAtLeast n a, AlignedToAtLeast n b) => AlignedToAtLeast2 n a b class (AlignedToAtLeast n a, AlignedToAtLeast n b, AlignedToAtLeast n c) => AlignedToAtLeast3 n a b c instance (AlignedToAtLeast n a, AlignedToAtLeast n b, AlignedToAtLeast n c) => AlignedToAtLeast3 n a b c (feel free to think of better names!)
unsafeXorSSE42 :: (Storable a, SV.AlignedToAtLeast3 SV.A16 o1 o2 o3) => SV.Vector o1 a -> SV.Vector o2 a -> SV.Vector o3 a
Implemented in [1]. Code says unsafeXorSSE42 :: (Storable a, SV.AlignedToAtLeast3 SV.A16 o1 o2 o3) => SV.Vector o1 a -> SV.Vector o2 a -> SV.Vector o3 a ghci says unsafeXorSSE42 :: (AlignedToAtLeast (Data.Vector.SIMD.Mutable.Twice A8) o2, AlignedToAtLeast (Data.Vector.SIMD.Mutable.Twice A8) o1, AlignedToAtLeast (Data.Vector.SIMD.Mutable.Twice A8) o3, Foreign.Storable.Storable a) => Vector o1 a -> Vector o2 a -> Vector o3 a which is the same thing, so should do.
That will require UndecidableInstances, but all that means is that GHC can't prove to itself that instance checking will terminate. So you could end up getting the compiler into an infinite loop (or in practice, to exceed its recursion limit). But it doesn't allow anything unsafe to happen at runtime, and there's plenty of perfectly good instances which terminate even if GHC can't prove it.
I see. I had some rather unsafe/undecidable-by-a-developer things in mind, but I guess I should read up on some of the language extensions GHC provides. Thanks, Nicolas [1] https://github.com/NicolasT/vector-simd/commit/8f934891c9630a96ce009fafa7f6b...

On Sat, 2012-07-07 at 21:13 +0200, Nicolas Trangez wrote:
As you can see, the zipWith Data.Vector.SIMD implementation is slightly slower than the Data.Vector.Storable based one. I didn't perform much profiling yet, but I suspect allocation and ForeignPtr creation is to blame, this seems to be highly optimized in GHC.ForeignPtr.mallocPlainForeignPtrBytes as used by Data.Vector.Storable.
I got the MV benchmark on-par with SV by reworking the allocation mechanism: no more FFI involved, but based on GHC.Exts.newAlignedPinnedByteArray# and some other trickery, see [1]. This could still be improved a little by using PlainPtr, but this is not exported by GHC.ForeignPtr. This did have a pretty big performance-impact on the SIMD-based benchmark, compare [2] to the old one [3]. I have no clue why the 4096 case now only uses twice the time of the 1024 one, unlike the expected 4x (+- as before). Nicolas [1] https://github.com/NicolasT/vector-simd/commit/5ec539167254435ef4e7d308706dc... [2] http://linode2.nicolast.be/files/vector-simd-xor2.html [3] http://linode2.nicolast.be/files/vector-simd-xor1.html

I've not been following this thread very closely, but it seems like what
you're trying to do may be related to Geoffrey Mainland's work on SIMD
support in GHC. See [1] for his "SIMD-enabled version of the vector
library". He's also written some blog posts about this [2].
Reiner
[1] https://github.com/mainland/vector
[2] http://ghc-simd.blogspot.com.au/
On 8 July 2012 05:13, Nicolas Trangez
All,
After my message of yesterday [1] I got down to it and implemented something along those lines. I created a playground repository containing the code at [2]. Initial benchmark results at [3]. More about the benchmark at the end of this email.
First some questions and requests for help:
- I'm stuck with a typing issue related to 'sizeOf' calculation at [4]. I tried a couple of things, but wasn't able to figure out how to fix it. - I'm using unsafePerformIO at [5], yet I'm not certain it's OK to do so. Are there better (safer/performant/...) ways to get this working? - Currently Alignment phantom types (e.g. A8 and A16) are not related to each other: a function (like Data.Vector.SIMD.Algorithms.unsafeXorSSE42) can have this signature:
unsafeXorSSE42 :: Storable a => SV.Vector SV.A16 a -> SV.Vector SV.A16 a -> SV.Vector SV.A16 a
Yet, imaging I'd have an "SV.Vector SV.A32 Word8" vector at hand, the function should accept it as well (a 32-byte aligned vector is also 16-byte aligned). Is there any way to encode this at the type level?
That's about it :-)
As of now, I only implemented a couple of the vector API functions (the ones required to execute my benchmark). Adding the others should be trivial.
The benchmark works with Data.Vector.{Unboxed|Storable}.Vector (UV and SV) vectors of Word8 values, as well as my custom Data.Vector.SIMD.Vector type (MV) using 16-byte alignment (MV.Vector MV.A16 Word8).
benchUV, benchSV and benchMV all take 2 pre-calculated Word8 vectors of given size (1024 and 4096) and xor them pairwise into the result using "zipWith xor". benchMVA takes 2 suitable MV vectors and xor's them into a third using a rather simple and unoptimized C implementation using SSE4.2 intrinsics [6]. This could be enhanced quite a bit (I guess using the prim calling convention, FFI overhead can be reduced as well). Currently, only vectors of a multiple of 32 bytes are supported (mostly because of laziness on my part).
As you can see, the zipWith Data.Vector.SIMD implementation is slightly slower than the Data.Vector.Storable based one. I didn't perform much profiling yet, but I suspect allocation and ForeignPtr creation is to blame, this seems to be highly optimized in GHC.ForeignPtr.mallocPlainForeignPtrBytes as used by Data.Vector.Storable.
Thanks for any input,
Nicolas
[1] http://www.haskell.org/pipermail/haskell-cafe/2012-July/102167.html [2] https://github.com/NicolasT/vector-simd/ [3] http://linode2.nicolast.be/files/vector-simd-xor1.html [4]
https://github.com/NicolasT/vector-simd/blob/master/src/Data/Vector/SIMD/Alg... [5]
https://github.com/NicolasT/vector-simd/blob/master/src/Data/Vector/SIMD/Alg... [6] https://github.com/NicolasT/vector-simd/blob/master/cbits/vector-simd.c#L47
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Sun, 2012-07-08 at 20:49 +1000, Reiner Pope wrote:
I've not been following this thread very closely, but it seems like what you're trying to do may be related to Geoffrey Mainland's work on SIMD support in GHC. See [1] for his "SIMD-enabled version of the vector library". He's also written some blog posts about this [2].
Thanks. I knew about this before (did at least some research ;-)), yet I think the scope is somewhat different (assuming my playground work has any scope at all...). I think Geoffreys work goes much further than what I'm trying to accomplish: it's not my intention to write SIMD-based code at a high level (which is, obviously, a great thing, but beyond my capabilities as of now I'm afraid). What I'd like to achieve is to be able to write SIMD code at the C level (or have SIMD code generated somehow through LLVM/Clang(/Poly?), ISPC or others), and call into this from Haskell, providing some guarantees w.r.t. alignment (and maybe length-in-bytes as well?) of the passed vector pointers (which you don't get when using plain Storable vectors as far as I could find/see). Consider it a very short-term stop-gap solution until the tools are in place to write high-level code which yields the same performance as a more low-level (assembly-level) implementation. Nicolas
participants (3)
-
Gábor Lehel
-
Nicolas Trangez
-
Reiner Pope