the last mile in haskell performance

Looking at this: https://downloads.haskell.org/~ghc/6.12.3/docs/html/users_guide/primitives.h... It seems that it is impossible to manage data in Haskell within a core without L1 cache faults. Except for unboxed arrays of primitive types. Since it is impossible to have unboxed arrays of user-defined types. Am I right? This is definitively very bad for tasks that are inherently single threaded and in general for the image of Haskell as a practical language. I have more to say about that, but I would like to know first if I´m right and second If there is some idea to going on to permit user defined boxed datatypes. Or if there is some low level trick for having it using foreign call and unsafeCoerce in some way, I know that the language ATS has unboxing a la carte.... -- Alberto.

This might be of interest to you: https://hackage.haskell.org/package/structs On 11/12/2015 6:49 PM, Alberto G. Corona wrote:
Looking at this:
https://downloads.haskell.org/~ghc/6.12.3/docs/html/users_guide/primitives.h...
It seems that it is impossible to manage data in Haskell within a core without L1 cache faults. Except for unboxed arrays of primitive types.
Since it is impossible to have unboxed arrays of user-defined types.
Am I right?
This is definitively very bad for tasks that are inherently single threaded and in general for the image of Haskell as a practical language.
I have more to say about that, but I would like to know first if I´m right and second If there is some idea to going on to permit user defined boxed datatypes. Or if there is some low level trick for having it using foreign call and unsafeCoerce in some way,
I know that the language ATS has unboxing a la carte....
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

There are no examples. It is hard to guess the functionality and the
maturity of he approach.
2015-11-12 18:56 GMT+01:00 David Kraeutmann
This might be of interest to you: https://hackage.haskell.org/package/structs
On 11/12/2015 6:49 PM, Alberto G. Corona wrote:
Looking at this:
https://downloads.haskell.org/~ghc/6.12.3/docs/html/users_guide/primitives.h...
It seems that it is impossible to manage data in Haskell within a core without L1 cache faults. Except for unboxed arrays of primitive types.
Since it is impossible to have unboxed arrays of user-defined types.
Am I right?
This is definitively very bad for tasks that are inherently single
threaded
and in general for the image of Haskell as a practical language.
I have more to say about that, but I would like to know first if I´m right and second If there is some idea to going on to permit user defined boxed datatypes. Or if there is some low level trick for having it using foreign call and unsafeCoerce in some way,
I know that the language ATS has unboxing a la carte....
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- Alberto.

How about vector-th-unbox[1]?
[1] https://www.stackage.org/package/vector-th-unbox
On Thu, Nov 12, 2015 at 2:53 PM, Alberto G. Corona
There are no examples. It is hard to guess the functionality and the maturity of he approach.
2015-11-12 18:56 GMT+01:00 David Kraeutmann
: This might be of interest to you: https://hackage.haskell.org/package/structs
On 11/12/2015 6:49 PM, Alberto G. Corona wrote:
Looking at this:
https://downloads.haskell.org/~ghc/6.12.3/docs/html/users_guide/primitives.h...
It seems that it is impossible to manage data in Haskell within a core without L1 cache faults. Except for unboxed arrays of primitive types.
Since it is impossible to have unboxed arrays of user-defined types.
Am I right?
This is definitively very bad for tasks that are inherently single
threaded
and in general for the image of Haskell as a practical language.
I have more to say about that, but I would like to know first if I´m right and second If there is some idea to going on to permit user defined boxed datatypes. Or if there is some low level trick for having it using foreign call and unsafeCoerce in some way,
I know that the language ATS has unboxing a la carte....
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- Alberto.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

That is more practical.
It is a pity that the Unboxed class does not interact with the UNPACK
pragma. Or it seems so.
The problem that round my head after being exposed to the Google phylosophy
of "count the bytes" is the trade off when choosing containers: Either
boxed, pure, linked multithreaded or unboxed mutable single threaded.
Haskell philosophy choose the first, while almost all the mainstream
languages choose the second. That is another problem for the adoption of
haskell. Google people say: we don´t need multithreaded programs because
we run many single threaded programs in one machine, we use all the cores.
Only when there is a single application to run the justification for the
extra effort of multithreading is justified. And this happens rarely in the
real world: In scientific, engineering, financial it is usual. It also
happens in distributed settings but in that last case, performance per
thread and core is also critical, so Haskell is ruled out.
But that hasn´t to be that way, or at least that is what I think.
DiffUArrays are internally mutable but with a pure interface. They use a
kind of versioning. in single threaded environments it theoretically
perform at mutable speeds. the versioning of diffArray is the blend of
packed and linked structure that can mix the two worlds.
If the unboxing is extended to any kind of user defined data, the
versioning idea can be used to have containers that perform at C speeds
when single threaded, but preserve the purity when multithreaded. So it is
possible to have the cake and eat it too.
it is even possible to codify balanced binary trees in a compact diffarray,
so very fast maps can be used in single threaded applications that also are
pure and work multithreaded.
The goal is to remove the objections about haskell coming from that side of
the computer industry by having such containers available without forcing
the user to know lots of things about the internals of haskell and GHC.
I do not know if there are thing going on in some of this direction. Maybe
I´m being simplistic and there is something that I miss .
By experience I know that what sell more from a language is not the real
performance numbers, but the approaches that the language takes and how
much that promises for the future:
For example I can develop a kind of container following this idea that
perform badly both in single and multithreaded. for sure the early version
should be so. But, if people understand that the design has potential for
being optimized in the future, people will buy the idea and will accept
happily the Haskell language for performance semi-critical apps because
they will have arguments against the objections of this kind .
2015-11-12 23:56 GMT+01:00 Michael Snoyman
How about vector-th-unbox[1]?
[1] https://www.stackage.org/package/vector-th-unbox
On Thu, Nov 12, 2015 at 2:53 PM, Alberto G. Corona
wrote: There are no examples. It is hard to guess the functionality and the maturity of he approach.
2015-11-12 18:56 GMT+01:00 David Kraeutmann
: This might be of interest to you: https://hackage.haskell.org/package/structs
On 11/12/2015 6:49 PM, Alberto G. Corona wrote:
Looking at this:
https://downloads.haskell.org/~ghc/6.12.3/docs/html/users_guide/primitives.h...
It seems that it is impossible to manage data in Haskell within a core without L1 cache faults. Except for unboxed arrays of primitive types.
Since it is impossible to have unboxed arrays of user-defined types.
Am I right?
This is definitively very bad for tasks that are inherently single
threaded
and in general for the image of Haskell as a practical language.
I have more to say about that, but I would like to know first if I´m right and second If there is some idea to going on to permit user defined boxed datatypes. Or if there is some low level trick for having it using foreign call and unsafeCoerce in some way,
I know that the language ATS has unboxing a la carte....
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- Alberto.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- Alberto.

Michael,
Thanks to your insight I took a look at Data.Vector.Unboxed
I was very happy, but there is a caveat; Looking at the documentation:
"In particular, unboxed vectors of pairs are represented as pairs of
unboxed vectors". https://tldrify.com/cfj
it seems that derivingUnbox and all the Unboxed Vectors code do as much as
is possible with what GHC offers to make things unboxed, which are reduced
to the primitive types.
So the only way to pack an user defined data is to create one Vector of
unboxed primitive type for each field of the data . So if I have a data
with five fields the result is five Vectors.
This is nice in some cases, but does most of the time does not. this does
not solve the problem of CPU cache since the fields in the data are at
least lenght (Vector) away. I mean that if the vector is moderately long,
if the first field is in the cache, the second or third etc may not be.
Usually the fields of any data are handled together.
This solution will not match C- C++ speed consistently for most single
threaded programs. Moreover there is an additional cost in the
packing/unpacking necessary.
However this is the best that can be done with what GHC offers now.
The solution is the possibility of unboxed user-defined types:
data MyUnboxedData= MyUnboxedData# Int# Float# ....
where MyUnboxedData# is an unboxed constructor
Then there are some questions:
1) Am I wrong?
2) It is feashible?
3) It is worth the pain? (My answer: yes This is not satisfactory, it is
very important for Haskell success and should be addressed if there is no
insurmountable barrier)
4) Are there alternatives before making a formal proposal?
Thanks
2015-11-13 12:42 GMT+01:00 Alberto G. Corona
That is more practical.
It is a pity that the Unboxed class does not interact with the UNPACK pragma. Or it seems so.
The problem that round my head after being exposed to the Google phylosophy of "count the bytes" is the trade off when choosing containers: Either boxed, pure, linked multithreaded or unboxed mutable single threaded.
Haskell philosophy choose the first, while almost all the mainstream languages choose the second. That is another problem for the adoption of haskell. Google people say: we don´t need multithreaded programs because we run many single threaded programs in one machine, we use all the cores. Only when there is a single application to run the justification for the extra effort of multithreading is justified. And this happens rarely in the real world: In scientific, engineering, financial it is usual. It also happens in distributed settings but in that last case, performance per thread and core is also critical, so Haskell is ruled out.
But that hasn´t to be that way, or at least that is what I think. DiffUArrays are internally mutable but with a pure interface. They use a kind of versioning. in single threaded environments it theoretically perform at mutable speeds. the versioning of diffArray is the blend of packed and linked structure that can mix the two worlds.
If the unboxing is extended to any kind of user defined data, the versioning idea can be used to have containers that perform at C speeds when single threaded, but preserve the purity when multithreaded. So it is possible to have the cake and eat it too.
it is even possible to codify balanced binary trees in a compact diffarray, so very fast maps can be used in single threaded applications that also are pure and work multithreaded.
The goal is to remove the objections about haskell coming from that side of the computer industry by having such containers available without forcing the user to know lots of things about the internals of haskell and GHC.
I do not know if there are thing going on in some of this direction. Maybe I´m being simplistic and there is something that I miss .
By experience I know that what sell more from a language is not the real performance numbers, but the approaches that the language takes and how much that promises for the future:
For example I can develop a kind of container following this idea that perform badly both in single and multithreaded. for sure the early version should be so. But, if people understand that the design has potential for being optimized in the future, people will buy the idea and will accept happily the Haskell language for performance semi-critical apps because they will have arguments against the objections of this kind .
2015-11-12 23:56 GMT+01:00 Michael Snoyman
: How about vector-th-unbox[1]?
[1] https://www.stackage.org/package/vector-th-unbox
On Thu, Nov 12, 2015 at 2:53 PM, Alberto G. Corona
wrote: There are no examples. It is hard to guess the functionality and the maturity of he approach.
2015-11-12 18:56 GMT+01:00 David Kraeutmann
: This might be of interest to you: https://hackage.haskell.org/package/structs
On 11/12/2015 6:49 PM, Alberto G. Corona wrote:
Looking at this:
https://downloads.haskell.org/~ghc/6.12.3/docs/html/users_guide/primitives.h...
It seems that it is impossible to manage data in Haskell within a core without L1 cache faults. Except for unboxed arrays of primitive types.
Since it is impossible to have unboxed arrays of user-defined types.
Am I right?
This is definitively very bad for tasks that are inherently single
threaded
and in general for the image of Haskell as a practical language.
I have more to say about that, but I would like to know first if I´m right and second If there is some idea to going on to permit user defined boxed datatypes. Or if there is some low level trick for having it using foreign call and unsafeCoerce in some way,
I know that the language ATS has unboxing a la carte....
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- Alberto.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- Alberto.
-- Alberto.

This is why CPUs have many independent cache lines. Unpacking a vector into multiple vectors is usually fine for performance. I have seen it actually increase performance, because it simplifies addressing. -Will
On Nov 14, 2015, at 05:30, Alberto G. Corona
wrote: This is nice in some cases, but does most of the time does not. this does not solve the problem of CPU cache since the fields in the data are at least lenght (Vector) away. I mean that if the vector is moderately long, if the first field is in the cache, the second or third etc may not be. Usually the fields of any data are handled together.

Indeed! There's no one true best format.
There was some work to add native support for c structs to the Haskell ffi,
as storable format for non recursive records, but there's some really
gnarly subtleties with respect to alignment and packing that come up that
likely are fundamentally impossible to characterize in a simple algorithmic
fashion that will serve everyone well.
On Saturday, November 14, 2015, Will Yager
This is why CPUs have many independent cache lines. Unpacking a vector into multiple vectors is usually fine for performance. I have seen it actually increase performance, because it simplifies addressing.
-Will
On Nov 14, 2015, at 05:30, Alberto G. Corona
javascript:_e(%7B%7D,'cvml','agocorona@gmail.com');> wrote: This is nice in some cases, but does most of the time does not. this does not solve the problem of CPU cache since the fields in the data are at least lenght (Vector) away. I mean that if the vector is moderately long, if the first field is in the cache, the second or third etc may not be. Usually the fields of any data are handled together.

Hi Carter
Now that the Strict pragma is being implemented, perhaps it is time to
retake that work.
I have leksah IDE installed. it is a haskell application. What I can verify
with my performance meter on Windows is that merely scrolling text, the
lekssah produces much more page faults than any non Haskell application.
2015-11-15 1:45 GMT+01:00 Carter Schonwald
Indeed! There's no one true best format. There was some work to add native support for c structs to the Haskell ffi, as storable format for non recursive records, but there's some really gnarly subtleties with respect to alignment and packing that come up that likely are fundamentally impossible to characterize in a simple algorithmic fashion that will serve everyone well.
On Saturday, November 14, 2015, Will Yager
wrote: This is why CPUs have many independent cache lines. Unpacking a vector into multiple vectors is usually fine for performance. I have seen it actually increase performance, because it simplifies addressing.
-Will
On Nov 14, 2015, at 05:30, Alberto G. Corona
wrote: This is nice in some cases, but does most of the time does not. this does not solve the problem of CPU cache since the fields in the data are at least lenght (Vector) away. I mean that if the vector is moderately long, if the first field is in the cache, the second or third etc may not be. Usually the fields of any data are handled together.
-- Alberto.

Hi Will,
Right, I'm not an expert on low level things, but yes, each memory page can
cache a different vector and even can work faster. Specially if the
algoritm uses a few fields of a large structure. I was wrong on that.
But anyway, Unboxed need more native support to give Haskell more
credibility in performance critical problems. Now it has some conversion
overhead for user defined data. That may be optimized away but the whole
thing is second class, via an instance instead of a language feature.
Maybe automatic deriving Unboxed instances can be the right compromise
2015-11-14 18:53 GMT+01:00 Will Yager
This is why CPUs have many independent cache lines. Unpacking a vector into multiple vectors is usually fine for performance. I have seen it actually increase performance, because it simplifies addressing.
-Will
On Nov 14, 2015, at 05:30, Alberto G. Corona
wrote: This is nice in some cases, but does most of the time does not. this does not solve the problem of CPU cache since the fields in the data are at least lenght (Vector) away. I mean that if the vector is moderately long, if the first field is in the cache, the second or third etc may not be. Usually the fields of any data are handled together.
-- Alberto.

"Alberto G. Corona "
Hi Will, Right, I'm not an expert on low level things, but yes, each memory page can cache a different vector and even can work faster. Specially if the algoritm uses a few fields of a large structure. I was wrong on that.
But anyway, Unboxed need more native support to give Haskell more credibility in performance critical problems. Now it has some conversion overhead for user defined data. That may be optimized away but the whole thing is second class, via an instance instead of a language feature.
Maybe automatic deriving Unboxed instances can be the right compromise
Given how the imagination immediately suggests that: 1. performance nuances dictated by the context might suggest different preferences for Unboxed encodings.. 2. and, on the other hand, given the undesirability of always engaging oneself into full-blown instance definition -- for various reasons.. ..it suggests that a language feature would be helpful, that would allow to gradually inform the derivation GHC makes, without fully engaging into specifying complete details. Can we imagine that a derivation could proceed, for example, from: - a /subset/ of minimal typeclass definition, or alternatively - direct parameters ..? Also, a set of deeper questions could be asked, if you can forgive me this daydreaming.. If we follow the spirit of self-defined languages (won't name them): 1. does it make sense to "open up" the process of derivation? 2. what prevents us from describing derivations as a functional transformation? 3. what is the possible language for these -- TH? something else? -- с уважениeм / respectfully, Косырев Сергей

Also, a set of deeper questions could be asked, if you can forgive me this
daydreaming..
If we follow the spirit of self-defined languages (won’t name them):
1.
does it make sense to “open up” the process of derivation?
2.
what prevents us from describing derivations as a functional
transformation?
3.
what is the possible language for these — TH? something else?
https://wiki.haskell.org/GHC.Generics?
2015-11-17 10:50 GMT+01:00 Kosyrev Serge
"Alberto G. Corona "
writes: Hi Will, Right, I'm not an expert on low level things, but yes, each memory page can cache a different vector and even can work faster. Specially if the algoritm uses a few fields of a large structure. I was wrong on that.
But anyway, Unboxed need more native support to give Haskell more credibility in performance critical problems. Now it has some conversion overhead for user defined data. That may be optimized away but the whole thing is second class, via an instance instead of a language feature.
Maybe automatic deriving Unboxed instances can be the right compromise
Given how the imagination immediately suggests that:
1. performance nuances dictated by the context might suggest different preferences for Unboxed encodings..
2. and, on the other hand, given the undesirability of always engaging oneself into full-blown instance definition -- for various reasons..
..it suggests that a language feature would be helpful, that would allow to gradually inform the derivation GHC makes, without fully engaging into specifying complete details.
Can we imagine that a derivation could proceed, for example, from:
- a /subset/ of minimal typeclass definition, or alternatively - direct parameters
..?
Also, a set of deeper questions could be asked, if you can forgive me this daydreaming..
If we follow the spirit of self-defined languages (won't name them):
1. does it make sense to "open up" the process of derivation?
2. what prevents us from describing derivations as a functional transformation?
3. what is the possible language for these -- TH? something else?
-- с уважениeм / respectfully, Косырев Сергей _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Also see
https://hackage.haskell.org/package/vector-0.11.0.0/docs/Data-Vector-Storabl...
On Thu, Nov 12, 2015 at 11:49 AM, Alberto G. Corona
Looking at this:
https://downloads.haskell.org/~ghc/6.12.3/docs/html/users_guide/primitives.h...
It seems that it is impossible to manage data in Haskell within a core without L1 cache faults. Except for unboxed arrays of primitive types.
Since it is impossible to have unboxed arrays of user-defined types.
Am I right?
This is definitively very bad for tasks that are inherently single threaded and in general for the image of Haskell as a practical language.
I have more to say about that, but I would like to know first if I´m right and second If there is some idea to going on to permit user defined boxed datatypes. Or if there is some low level trick for having it using foreign call and unsafeCoerce in some way,
I know that the language ATS has unboxing a la carte....
-- Alberto.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

A buildt-in derivation of the Storable instance would be the solution for
the problem. Perhaps a meaningful summer of code project?
2015-11-12 20:38 GMT+01:00 William Yager
Also see https://hackage.haskell.org/package/vector-0.11.0.0/docs/Data-Vector-Storabl...
On Thu, Nov 12, 2015 at 11:49 AM, Alberto G. Corona
wrote: Looking at this:
https://downloads.haskell.org/~ghc/6.12.3/docs/html/users_guide/primitives.h...
It seems that it is impossible to manage data in Haskell within a core without L1 cache faults. Except for unboxed arrays of primitive types.
Since it is impossible to have unboxed arrays of user-defined types.
Am I right?
This is definitively very bad for tasks that are inherently single threaded and in general for the image of Haskell as a practical language.
I have more to say about that, but I would like to know first if I´m right and second If there is some idea to going on to permit user defined boxed datatypes. Or if there is some low level trick for having it using foreign call and unsafeCoerce in some way,
I know that the language ATS has unboxing a la carte....
-- Alberto.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- Alberto.
participants (9)
-
Alberto G. Corona
-
Carter Schonwald
-
David Kraeutmann
-
Janis Voigtländer
-
Kosyrev Serge
-
Michael Snoyman
-
Tom Ellis
-
Will Yager
-
William Yager