
Hi all, I was just teaching a class on minimization methods, with a focus on conjugate gradient minimization in particular, and one of my main points was that with conjugate gradient minimization we only need three or four arrays of size N (depending on whether we use the Fletcher-Reeves or Polak-Ribiere variant), that this was a huge win, and I was explaining that in my code we use Fletcher-Reeves even though it's generally got less stable convergence behavior, because avoiding one extra copy is well worth it, when each copy is on the order of a gigabyte (which is not unusual in my line of work). This got me thinking about one of the largest problems with writing serious numerical code in Haskell, which is that of memory consumption and avoiding temporary variables. The count of three arrays requires assumes that we do in-place updates, which presupposes an imperative model. But the main reason I'd like to switch to Haskell is so that I can write pure functional code, which is easier to read, easier to reason with, and easier to get right. So I started wondering whether there's a solution that would allow us to write pretty high-level pure functional code, while the RTS can realize at run-time that we have the only reference to the input argument and that it is therefore safe to consume it destructively. My thought is to move some of the amazing rewriting streams stuff that's going on with Data.ByteStream and Manuel's array fusion work onto the memory management level. If we can rewrite high-level array manipulations into a form in which the compiler can fuse the loops and write good low-level code, why not also allow automatic in-place updates (when possible)? My thought is to create a function like the following: copy :: ForeignPtr a -> IO (ForeignPtr a) This function ordinarily copies its input into a new chunk of memory, and then calls the initialization function to update that memory. So we could call it with something like a `plus` b = do a' <- copy a a' `plusEqual` b return a' Of course, in practice I imagine this whole business being encapsulated in a module where it's called with unsafePerformIO (but it's really perfectly safe, a la Data.ByteString). The key is that copy will ask the GC to check whether there is more than one reference to its input, and if there isn't, then it'll return its input without making a copy. Obviously, we'd want to be careful in making calls to copy (since the GC overhead could overwhelm the potential (transient) memory savings. And I've got no idea how hard it would be to find out if there are no other references, or how effective this would be... that is, whether the compiler might generate "false" references to the input that would prevent the savings from ever being realized. There are also other questions. My copy above is simple, but probably we'd be better off with a more general: copy :: a -> (a -> IO a) -> IO a which could be used in the base libraries for other sorts of objects, such as Array, which might not have a ForeignPtr at their back end. We might also want a more general copy that would handle the case where we would like to copy either one variable or another, and we don't care which. i.e. we in my `plus` example above, if "a" can't be discarded, but "b" can, we'll be out of luck. Perhaps we want something like reuseOrNot :: a -> (a -> IO b) -> (a -> IO b) -> IO b which would call one function or the other, depending on whether the input is deemed expendible by the GC, so we could write a `plus` b = reuseOrNot a (\a' -> do { a' `plusEqual` b; return a' }) (\a' -> reuseOrNot b (\b' -> do { b' `plusEqual` a'; return b' }) (\b' -> a' `boringAdd` b')) where boringAdd is an addition that doesn't reuse either array. Anyhow, all this would be predicated on the possibility of asking the GC whether anyone else has a reference to a given object, and I've no idea how tough (or slow) this would be. It'd also be effectively predicated on someone like dons having the time and interest to write good fusion code which would allow one to make use of this in referentially transparent high-level pure functional code. I think it'd be pretty cool, and would certainly (not immediately, but as my research group ramps up) be interested in being a guinea pig user of this stuff. Any thoughts? Am I crazy? Is this infeasible for some reason? Should I have patented this idea before mentioning it on the list? Has someone else already patented it? -- David Roundy Department of Physics Oregon State University

On 3/8/07, David Roundy
I started wondering whether there's a solution that would allow us to write pretty high-level pure functional code, while the RTS can realize at run-time that we have the only reference to the input argument and that it is therefore safe to consume it destructively.
I think you're talking about uniqueness typing which is supported by the programming language Clean. -- Dan

Or possibly more generally copy-on-write, which requires one more level of indirection (handle instead of ptr). Since you are talking about using ForeignPtr, this is already within your power to prototype, I should think. Dan Dan Piponi wrote:
On 3/8/07, David Roundy
wrote: I started wondering whether there's a solution that would allow us to write pretty high-level pure functional code, while the RTS can realize at run-time that we have the only reference to the input argument and that it is therefore safe to consume it destructively.
I think you're talking about uniqueness typing which is supported by the programming language Clean. -- Dan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

I'm thinking you're missing the point. The point is to copy without writing, and that requires some knowledge (whether static or runtime) of whether anyone else has a reference to my data--which copy-on-write won't give me. David On Thu, Mar 08, 2007 at 11:15:25AM -0800, Dan Weston wrote:
Or possibly more generally copy-on-write, which requires one more level of indirection (handle instead of ptr). Since you are talking about using ForeignPtr, this is already within your power to prototype, I should think.
Dan
Dan Piponi wrote:
On 3/8/07, David Roundy
wrote: I started wondering whether there's a solution that would allow us to write pretty high-level pure functional code, while the RTS can realize at run-time that we have the only reference to the input argument and that it is therefore safe to consume it destructively.
I think you're talking about uniqueness typing which is supported by the programming language Clean.

On Mar 8, 2007, at 14:21 , David Roundy wrote:
I'm thinking you're missing the point. The point is to copy without writing, and that requires some knowledge (whether static or runtime) of whether anyone else has a reference to my data--which copy-on-write won't give me.
Actually, I was thinking this sounded a lot like DiffArrays. You don't actually get a guarantee that anything else isn't holding a reference, but you do get destructive update with any old references being quietly changed to diffs against the head. -- brandon s. allbery [linux,solaris,freebsd,perl] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On Thu, Mar 08, 2007 at 02:50:51PM -0500, Brandon S. Allbery KF8NH wrote:
On Mar 8, 2007, at 14:21 , David Roundy wrote:
I'm thinking you're missing the point. The point is to copy without writing, and that requires some knowledge (whether static or runtime) of whether anyone else has a reference to my data--which copy-on-write won't give me.
Actually, I was thinking this sounded a lot like DiffArrays. You don't actually get a guarantee that anything else isn't holding a reference, but you do get destructive update with any old references being quietly changed to diffs against the head.
Except that DiffArrays are slow and expensive in both space and time (compared with strict unboxed arrays). They necesarily hold boxed values so you pay a factor of at least two in space cost (for arrays of Doubles) right off the top, and there's no way you could recover that. I'll admit that they could be useful for something, but I couldn't guess what. In my experience, arrays are always used with strict values, and updates change all the values in an array. -- David Roundy Department of Physics Oregon State University

Except that DiffArrays are slow and expensive in both space and time (compared with strict unboxed arrays). They necesarily hold boxed values so you pay a factor of at least two in space cost (for arrays of Doubles) right off the top, and there's no way you could recover that. I'll admit that they could be useful for something, but I couldn't guess what. In my experience, arrays are always used with strict values, and updates change all the values in an array.
i never quite understood why DiffArrays are so slow at the moment, so perhaps this is a good opportunity to ask for enlightenment on this issue?-) and i don't understand why they should necessarily hold boxed values only: the idea was to hold an efficient full version at the front, and slower differences at the back. what would stop the main version from being unboxed? when i looked into this a while ago, it seemed that the implementation had a lot of overhead, mostly due to avoiding concurrency issues? perhaps the main version could be associated with a specific concurrent haskell thread, so that only other threads would have to pay the management fee, while the common pattern of single-threaded (pardon the pun;) access would be spared that overhead? to make sure that secondary references cannot access the main copy while it is being updated in place, perhaps each thread might need its own main copy (again, that extra price would not be payable in single-threaded practice)? or is that impossible? claus

On Thu, Mar 08, 2007 at 08:45:06PM -0000, Claus Reinke wrote:
Except that DiffArrays are slow and expensive in both space and time (compared with strict unboxed arrays). They necesarily hold boxed values so you pay a factor of at least two in space cost (for arrays of Doubles) right off the top, and there's no way you could recover that. I'll admit that they could be useful for something, but I couldn't guess what. In my experience, arrays are always used with strict values, and updates change all the values in an array.
i never quite understood why DiffArrays are so slow at the moment, so perhaps this is a good opportunity to ask for enlightenment on this issue?-)
and i don't understand why they should necessarily hold boxed values only: the idea was to hold an efficient full version at the front, and slower differences at the back. what would stop the main version from being unboxed?
I guess if you do it in this way you'd still have worse memory use than with ordinary arrays, since you'd hang onto the old unboxed version indefinitely, and also store a bunch of boxed new versions. The real issue for me is that DiffArrays only make any sense at all if you often update a subset of your array, which I never expect to do... so even if they were efficiently implemented to the point of zero overhead, they would gain me nothing, and that's almost certainly overly optimistic. And to be honest, I'm doubtful that there are many applications in which you want to update only a very small subset of an array, so I'm not really sure what DiffArrays are intended to be used for. Once you've updated all (or most of) the elements, DiffArrays are as slow as their back end, and you may as well just use their back end without the DiffArray business. -- David Roundy Department of Physics Oregon State University

On Mar 8, 2007, at 16:27 , David Roundy wrote:
The real issue for me is that DiffArrays only make any sense at all if you often update a subset of your array, which I never expect to do... so even if they were efficiently implemented to the point of zero overhead, they would gain me nothing, and that's almost certainly overly optimistic.
But that's not my understanding of what's *supposed* to be happening: the point of DiffArrays is is not optimizing partial updates, it's optimizing the head at the expense of any (by intent few or none) references that might be held elsewhere. As such, if there are no such references the DiffArray *should* get you cheap in- place (destructive) updates. It's possible that the current *implementation* is flawed in the way you describe; if so, that should probably be brought up on the libraries list, because the documentation and the intent seem to be saying otherwise. -- brandon s. allbery [linux,solaris,freebsd,perl] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On Thu, Mar 08, 2007 at 04:32:01PM -0500, Brandon S. Allbery KF8NH wrote:
On Mar 8, 2007, at 16:27 , David Roundy wrote:
The real issue for me is that DiffArrays only make any sense at all if you often update a subset of your array, which I never expect to do... so even if they were efficiently implemented to the point of zero overhead, they would gain me nothing, and that's almost certainly overly optimistic.
But that's not my understanding of what's *supposed* to be happening: the point of DiffArrays is is not optimizing partial updates, it's optimizing the head at the expense of any (by intent few or none) references that might be held elsewhere. As such, if there are no such references the DiffArray *should* get you cheap in- place (destructive) updates.
Ah, I see. Yes, I misunderstood. When I read about DiffArrays (ages ago), I thought they stored the old array plus differences, a mistake which makes sense given that I wasn't yet comfortable with safe encapsulation of unsafePerformIO. You're right, but the cheap destructive updates still presumably involve creating and garbage-collecting O(N) thunks, which isn't particularly cheap, in my opinion. Or if the update function holds a reference to the original array, then you'll have to store two copies of the array in a non-dense format, and we won't have gained anything, as far as the temporary goes. -- David Roundy Department of Physics Oregon State University

i never quite understood why DiffArrays are so slow at the moment, so perhaps this is a good opportunity to ask for enlightenment on this issue?-)
it seems i should qualify that. some simple-minded testing shows that DiffArrays do become faster than Arrays if the arrays get big enough. for small arrays, though, just copying the Array seems faster than maintaining the DiffArray. copying then discarding old arrays can be cheaper than the overhead associated with avoiding copies. so in the simple test attached, loop 10000 a has Array winning for n=10, and DiffArray winning for n=30, with DiffUArray always lagging behind DiffArray. and that is for the most expensive kind of update - updating every element, one element at a time (meaning one array copy for each element). for bulk-updates (whole-array operations), the picture is likely to be even worse for DiffArrays, as the Array version needs only one copy, then a loop to fill it. in the test code, loop2 10000 (n=30) has Array winning, Diff(U)Array far behind. that is, provided i got the strictness right - otherwise, lazyness could lead the code to access old versions of arrays, defeating the object of single-threaded access to only the newest version of the DiffArray.. so, your intuition seems to have been more or less correct: if one can do whole-array updates, DiffArrays don't seem to buy anything (the copying cost for Array disappears in the update loop, the cost of managing DiffArray is not offset by any gains), if one has to do partial updates, DiffArrays can be a win, provided the Array copying gets more expensive than the DiffArray management overhead. and unboxed DiffArrays, while they exist, don't seem to buy much (apart from making it easier to avoid the non-strictness pitfall). but if one does whole-array updates, there isn't much to be gained by avoiding intermediate copies, is there? whether you fill a new memory area, or overwrite an old one, you still have to loop through the whole thing, writing and reading. claus

On Fri, Mar 09, 2007 at 04:02:04PM -0000, Claus Reinke wrote:
but if one does whole-array updates, there isn't much to be gained by avoiding intermediate copies, is there? whether you fill a new memory area, or overwrite an old one, you still have to loop through the whole thing, writing and reading.
I would gain a 20% reduction in memory use, which could come out to a few gigabytes for large problems. That's worth some degree of effort. Actually, it'd be a bit less than 20%, but for large problems it would approach 20%. -- David Roundy Department of Physics Oregon State University

On Fri, Mar 09, 2007 at 04:02:04PM -0000, Claus Reinke wrote:
but if one does whole-array updates, there isn't much to be gained by avoiding intermediate copies, is there? whether you fill a new memory area, or overwrite an old one, you still have to loop through the whole thing, writing and reading.
I would gain a 20% reduction in memory use, which could come out to a few gigabytes for large problems. That's worth some degree of effort. Actually, it'd be a bit less than 20%, but for large problems it would approach 20%.
ah, ok, i'm not used to thinking in such scales;-) (perhaps you should get in touch with those SAC people, after all - i don't know what their state of play is, but many years ago, they started in an office near mine, and they were definitely thinking about large arrays, even about how to distribute them, and computations on them; also, they used to bench against Fortran and vendor-optimised compilers, not against generic C) but still, even for memory use: if you can run the algorithm in-place, that means that the total of memory in use never exceeds the size of one array, even if the memory manager might not notice. the slice of indices in the new array already written can never again be accessed in the old array, the slice of indices in the old array still valid can not yet have been written in the new array. assuming that GHC's GC isn't clever enough to reclaim parts of arrays, and that wasting even virtual memory on unused space slows down the code by driving it into the wrong level of the memory hierarchy, would it help to split your arrays into slices represented as separate arrays, so that each slice could be abandoned during GC as soon as the algorithm is done with it? to avoid having to deal with complex internal boundary conditions, one might try to define suitable abstractions, eg instead of Array (Int,Int) Int one could try Array Int (Array Int Int), and define all operations on that. or, if we are talking about a one-dimensional vector, perhaps use divMod to distribute the indices over a couple of sub-vectors (from 1xN to 2x(N/2))? given these wild speculations, perhaps i should mention my reason: i've been a fan of reference-counting and the in-place update possibilities it opens, and i have often been annoyed by the persistent reluctance of well-written GC- base code to perform worse than RC-based counterparts. with such an apparent waste of resources and lack of precise knowledge, it seems unfair, but i had to admit it seems to work rather better than i expected (and when reality doesn't adhere to theory, reality is unlikely to give in first;-). whether this still holds for the level of number-crunching you are considering, i can't say - according to some of their publications, SAC seems to do quite well, at a performance-level Haskell has yet to reach, at the price of lower expressiveness for non-array programs (but higher-level array operations).. claus

Hello Claus, Saturday, March 10, 2007, 4:36:22 AM, you wrote:
ah, ok, i'm not used to thinking in such scales;-) (perhaps you should get in touch with those SAC people, after all - i don't know what their state of play is, but many years ago, they started in an office near mine, and they were definitely thinking about large arrays, even about how to distribute them, and computations on them;
last days i learned details of google's MapReduce system. seems that this approach is very interesting for dealing with large arrays. files (arrays) are splitted into chunks, operations are splitted into chunks, too. afaik, some C compilers are already able to automatically split vector operations into several threads? at least, it will be interesting to implement same technique for GHC, may be just in form of library, like google does -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin:
Hello Claus,
Saturday, March 10, 2007, 4:36:22 AM, you wrote:
ah, ok, i'm not used to thinking in such scales;-) (perhaps you should get in touch with those SAC people, after all - i don't know what their state of play is, but many years ago, they started in an office near mine, and they were definitely thinking about large arrays, even about how to distribute them, and computations on them;
last days i learned details of google's MapReduce system. seems that this approach is very interesting for dealing with large arrays. files (arrays) are splitted into chunks, operations are splitted into chunks, too. afaik, some C compilers are already able to automatically split vector operations into several threads? at least, it will be interesting to implement same technique for GHC, may be just in form of library, like google does
See the data parallel arrays library: http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell -- Don

On 3/11/07, Donald Bruce Stewart
bulat.ziganshin:
Hello Claus,
Saturday, March 10, 2007, 4:36:22 AM, you wrote:
ah, ok, i'm not used to thinking in such scales;-) (perhaps you should get in touch with those SAC people, after all - i don't know what their state of play is, but many years ago, they started in an office near mine, and they were definitely thinking about large arrays, even about how to distribute them, and computations on them;
last days i learned details of google's MapReduce system. seems that this approach is very interesting for dealing with large arrays. files (arrays) are splitted into chunks, operations are splitted into chunks, too. afaik, some C compilers are already able to automatically split vector operations into several threads? at least, it will be interesting to implement same technique for GHC, may be just in form of library, like google does
See the data parallel arrays library:
http://haskell.org/haskellwiki/GHC/Data_Parallel_Haskell
-- Don
(which, btw, is much more interesting than google's mapreduce stuff, since it does *nested* data parallelism) -- Sebastian Sylvan +46(0)736-818655 UIN: 44640862

Claus Reinke wrote:
i never quite understood why DiffArrays are so slow at the moment, so perhaps this is a good opportunity to ask for enlightenment on this issue?-)
it seems i should qualify that. some simple-minded testing shows that DiffArrays do become faster than Arrays if the arrays get big enough. for small arrays, though, just copying the Array seems faster than maintaining the DiffArray. copying then discarding old arrays can be cheaper than the overhead associated with avoiding copies.
Looking at the implementation of DiffArrays, there are some obvious optimisations that aren't done. UNPACKing everything in sight would be a good start. I guess this code has never really had a performance evaluation done on it. DiffUArray stores the array unboxed, but the difference elements are stored boxed, as far as I can tell. That means extra unnecessary allocation for each update. Also the indices are stored boxed - there's no reason to do that for either kind of DiffArray. All the operations are wrapped in unsafePerformIO, which is far from optimal (unsafePerformIO is not inlined, we should use inlinePerformIO if possible). Would someone like to shake this library into shape? I bet there's a significant performance win to be had. Cheers, Simon

Hello Simon, Friday, March 9, 2007, 7:44:46 PM, you wrote:
Looking at the implementation of DiffArrays, there are some obvious optimisations that aren't done.
... and don't forget that it uses MVar instead of IORef to be thread-safe -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Simon,
Friday, March 9, 2007, 7:44:46 PM, you wrote:
Looking at the implementation of DiffArrays, there are some obvious optimisations that aren't done.
.... and don't forget that it uses MVar instead of IORef to be thread-safe
I don't see any obvious optimisations there - some kind of thread-safety is essential. We could try IORef and atomicModifyIORef, but I doubt it would be much quicker, if at all. The other optimisations I mentioned are easy wins, though. Cheers, Simon

Hello Simon, Tuesday, March 13, 2007, 1:00:46 PM, you wrote:
Looking at the implementation of DiffArrays, there are some obvious .... and don't forget that it uses MVar instead of IORef to be I don't see any obvious optimisations there - some kind of thread-safety is essential.
for general implementation - yes. but we can, for example, make this a type parameter, so that two kinds of DiffArray will exist -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Looking at the implementation of DiffArrays, there are some obvious optimisations that aren't done.
.... and don't forget that it uses MVar instead of IORef to be thread-safe
I don't see any obvious optimisations there - some kind of thread-safety is essential. We could try IORef and atomicModifyIORef, but I doubt it would be much quicker, if at all. The other optimisations I mentioned are easy wins, though.
it does look as if the measures for thread-safety protect us from the case in which DiffArray shouldn't be used anyway, slowing down the intended use case. though one could probably come up with a program that accesses a DiffArray in a single- threaded manner from a sequence of different Haskell threads;) that does not seem a likely usage scenario, does it? is there a way to make DiffArray access thread-specific, so that accessing a DiffArray from a different thread would make a copy specific to that thread, similar to how old versions are handled, while each thread has fast access to its own DAs? how big is the impact of those MVars anyway, compared to a version that wouldn't have to worry about threads? claus

Hello Claus, Tuesday, March 13, 2007, 2:10:49 PM, you wrote:
how big is the impact of those MVars anyway, compared to a version that wouldn't have to worry about threads?
on my 1GHz cpu, withMVar locking makes 2*10^6 cycles/sec, compared with 10-100x faster IORef operation -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

David Roundy wrote:
Actually, I was thinking this sounded a lot like DiffArrays.
Except that DiffArrays are slow and expensive in both space and time (compared with strict unboxed arrays). They necesarily hold boxed values so you pay a factor of at least two in space cost (for arrays of Doubles) right off the top, and there's no way you could recover that. What about using a DiffArray with reasonably-sized (cache-friendly?) UArrays as the elements? That way, the cost of boxing can be amortized over more Doubles. Efficiency would depend on updated patterns, of course.
-k

I might be missing the point, but I think you are missing mine. The copy-on-write I am talking about means that it's no longer "your data", so you don't need any knowledge of who has access to it because you don't own it or have a pointer to it. It is owned by some broker from which you request a read-only or write access handle as needed. Requested changes to underlying data already shared by others triggers a copy and reassignment of pointers to it for your handle alone. The copy cost appears only when there is more than one handle to the same data and one of them changes it. All this can be wrapped up and hidden away. If you want to escape this broker business and steal back your data, just ask: the broker will duplicate shared data needed by others, change their pointers to it, then disown the pointer it returns to you. This is copying without writing (unnecessarily). Or am I missing something? Dan David Roundy wrote:
I'm thinking you're missing the point. The point is to copy without writing, and that requires some knowledge (whether static or runtime) of whether anyone else has a reference to my data--which copy-on-write won't give me.
David
On Thu, Mar 08, 2007 at 11:15:25AM -0800, Dan Weston wrote:
Or possibly more generally copy-on-write, which requires one more level of indirection (handle instead of ptr). Since you are talking about using ForeignPtr, this is already within your power to prototype, I should think.
Dan
Dan Piponi wrote:
On 3/8/07, David Roundy
wrote: I started wondering whether there's a solution that would allow us to write pretty high-level pure functional code, while the RTS can realize at run-time that we have the only reference to the input argument and that it is therefore safe to consume it destructively. I think you're talking about uniqueness typing which is supported by the programming language Clean.
Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Ah, I was missing your point, I've heard something called copy-on-write, which wasn't what you describe (or I also misunderstood it when I heard it before). I see. But how would one manage these handles? What's to keep me from accidentally copying a handle? It sounds like it'd require explicit memory management, in order to avoid ever copying a handle, if I were to implment this myself. Or are you suggesting that if the simons implemented a copy-on-write scheme in ghc's RTS, then I'd be all set? In short, managing the reader count is exactly the problem that sounds hard, and I still don't have any idea how one would go about it. David On Thu, Mar 08, 2007 at 01:31:19PM -0800, Dan Weston wrote:
I might be missing the point, but I think you are missing mine.
The copy-on-write I am talking about means that it's no longer "your data", so you don't need any knowledge of who has access to it because you don't own it or have a pointer to it. It is owned by some broker from which you request a read-only or write access handle as needed. Requested changes to underlying data already shared by others triggers a copy and reassignment of pointers to it for your handle alone.
The copy cost appears only when there is more than one handle to the same data and one of them changes it.
All this can be wrapped up and hidden away. If you want to escape this broker business and steal back your data, just ask: the broker will duplicate shared data needed by others, change their pointers to it, then disown the pointer it returns to you.
This is copying without writing (unnecessarily). Or am I missing something?
Dan
David Roundy wrote:
I'm thinking you're missing the point. The point is to copy without writing, and that requires some knowledge (whether static or runtime) of whether anyone else has a reference to my data--which copy-on-write won't give me.
David
On Thu, Mar 08, 2007 at 11:15:25AM -0800, Dan Weston wrote:
Or possibly more generally copy-on-write, which requires one more level of indirection (handle instead of ptr). Since you are talking about using ForeignPtr, this is already within your power to prototype, I should think.
Dan
Dan Piponi wrote:
On 3/8/07, David Roundy
wrote: I started wondering whether there's a solution that would allow us to write pretty high-level pure functional code, while the RTS can realize at run-time that we have the only reference to the input argument and that it is therefore safe to consume it destructively. I think you're talking about uniqueness typing which is supported by the programming language Clean.

David Roundy
I see. But how would one manage these handles? What's to keep me from accidentally copying a handle? It sounds like it'd require explicit memory management, in order to avoid ever copying a handle, if I were to implment this myself.
Because you seem to write imperative code anyways: can't you simply use a specialized state monad with the array(s) hidden inside the monad as monad state? -Matthias -- Matthias Neubauer | Universität Freiburg, Institut für Informatik | tel +49 761 203 8060 Georges-Köhler-Allee 79, 79110 Freiburg i. Br., Germany | fax +49 761 203 8052

On Thu, Mar 08, 2007 at 11:09:35PM +0100, Matthias Neubauer wrote:
David Roundy
writes: I see. But how would one manage these handles? What's to keep me from accidentally copying a handle? It sounds like it'd require explicit memory management, in order to avoid ever copying a handle, if I were to implment this myself.
Because you seem to write imperative code anyways: can't you simply use a specialized state monad with the array(s) hidden inside the monad as monad state?
No, the point is to avoid writing imperative code. My examples used imperative code, but that would be wrapped at the lowest level of the array library, and all the real code would be pure. -- David Roundy Department of Physics Oregon State University

David Roundy
No, the point is to avoid writing imperative code. My examples used imperative code, but that would be wrapped at the lowest level of the array library, and all the real code would be pure.
Still sounds like a state monad to me. Your 'array library', I mean. -Matthias -- Matthias Neubauer | Universität Freiburg, Institut für Informatik | tel +49 761 203 8060 Georges-Köhler-Allee 79, 79110 Freiburg i. Br., Germany | fax +49 761 203 8052

On Mar 8, 2007, at 17:09 , Matthias Neubauer wrote:
David Roundy
writes: I see. But how would one manage these handles? What's to keep me from accidentally copying a handle? It sounds like it'd require explicit memory management, in order to avoid ever copying a handle, if I were to implment this myself.
Because you seem to write imperative code anyways: can't you simply use a specialized state monad with the array(s) hidden inside the monad as monad state?
Err, wasn't the point was that he wanted a pure way to get out of writing that imperative code? -- brandon s. allbery [linux,solaris,freebsd,perl] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Have you looked into using STM (Software Transactional Memory)? This problem seems like some subset of concurrent programming. Dan David Roundy wrote:
Ah, I was missing your point, I've heard something called copy-on-write, which wasn't what you describe (or I also misunderstood it when I heard it before).
I see. But how would one manage these handles? What's to keep me from accidentally copying a handle? It sounds like it'd require explicit memory management, in order to avoid ever copying a handle, if I were to implment this myself.
Or are you suggesting that if the simons implemented a copy-on-write scheme in ghc's RTS, then I'd be all set?
In short, managing the reader count is exactly the problem that sounds hard, and I still don't have any idea how one would go about it.
David
On Thu, Mar 08, 2007 at 01:31:19PM -0800, Dan Weston wrote:
I might be missing the point, but I think you are missing mine.
The copy-on-write I am talking about means that it's no longer "your data", so you don't need any knowledge of who has access to it because you don't own it or have a pointer to it. It is owned by some broker from which you request a read-only or write access handle as needed. Requested changes to underlying data already shared by others triggers a copy and reassignment of pointers to it for your handle alone.
The copy cost appears only when there is more than one handle to the same data and one of them changes it.
All this can be wrapped up and hidden away. If you want to escape this broker business and steal back your data, just ask: the broker will duplicate shared data needed by others, change their pointers to it, then disown the pointer it returns to you.
This is copying without writing (unnecessarily). Or am I missing something?
Dan
David Roundy wrote:
I'm thinking you're missing the point. The point is to copy without writing, and that requires some knowledge (whether static or runtime) of whether anyone else has a reference to my data--which copy-on-write won't give me.
David
On Thu, Mar 08, 2007 at 11:15:25AM -0800, Dan Weston wrote:
Or possibly more generally copy-on-write, which requires one more level of indirection (handle instead of ptr). Since you are talking about using ForeignPtr, this is already within your power to prototype, I should think.
Dan
Dan Piponi wrote:
On 3/8/07, David Roundy
wrote: I started wondering whether there's a solution that would allow us to write pretty high-level pure functional code, while the RTS can realize at run-time that we have the only reference to the input argument and that it is therefore safe to consume it destructively. I think you're talking about uniqueness typing which is supported by the programming language Clean.
Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Nothing is being done concurrently, so I don't see what STM would gain us. What is it that you're thinking we could gain from STM? David On Thu, Mar 08, 2007 at 03:04:43PM -0800, Dan Weston wrote:
Have you looked into using STM (Software Transactional Memory)? This problem seems like some subset of concurrent programming.
Dan
David Roundy wrote:
Ah, I was missing your point, I've heard something called copy-on-write, which wasn't what you describe (or I also misunderstood it when I heard it before).
I see. But how would one manage these handles? What's to keep me from accidentally copying a handle? It sounds like it'd require explicit memory management, in order to avoid ever copying a handle, if I were to implment this myself.
Or are you suggesting that if the simons implemented a copy-on-write scheme in ghc's RTS, then I'd be all set?
In short, managing the reader count is exactly the problem that sounds hard, and I still don't have any idea how one would go about it.
David
On Thu, Mar 08, 2007 at 01:31:19PM -0800, Dan Weston wrote:
I might be missing the point, but I think you are missing mine.
The copy-on-write I am talking about means that it's no longer "your data", so you don't need any knowledge of who has access to it because you don't own it or have a pointer to it. It is owned by some broker from which you request a read-only or write access handle as needed. Requested changes to underlying data already shared by others triggers a copy and reassignment of pointers to it for your handle alone.
The copy cost appears only when there is more than one handle to the same data and one of them changes it.
All this can be wrapped up and hidden away. If you want to escape this broker business and steal back your data, just ask: the broker will duplicate shared data needed by others, change their pointers to it, then disown the pointer it returns to you.
This is copying without writing (unnecessarily). Or am I missing something?
Dan
David Roundy wrote:
I'm thinking you're missing the point. The point is to copy without writing, and that requires some knowledge (whether static or runtime) of whether anyone else has a reference to my data--which copy-on-write won't give me.
David
On Thu, Mar 08, 2007 at 11:15:25AM -0800, Dan Weston wrote:
Or possibly more generally copy-on-write, which requires one more level of indirection (handle instead of ptr). Since you are talking about using ForeignPtr, this is already within your power to prototype, I should think.
Dan
Dan Piponi wrote:
On 3/8/07, David Roundy
wrote: >I started wondering whether there's a solution that would allow us to >write pretty high-level pure functional code, while the RTS can >realize >at run-time that we have the only reference to the input argument and >that it is therefore safe to consume it destructively. I think you're talking about uniqueness typing which is supported by the programming language Clean.

On 3/9/07, David Roundy
Nothing is being done concurrently, so I don't see what STM would gain us. What is it that you're thinking we could gain from STM?
Its shiny and new, so it will make your code look sexy? :-) So what happened to linear types? I remember reading Once Upon A Type and Linear Types can Change the World during my hons year (and that was more than 10 years ago). Mercury uses linear modes which are much the same. One of the arguments Fergus Henderson made was that in the case of arrays, the update cost if you have to copy is too much, so you use linear types/modes to make it impossible to do so. While you can debate the pros and cons, the fact that a trivial "mistake" in your program could change it from using in-place update to using copying does lend weight to the argument for distinguishing the copying and non-copying forms at the type level. cheers, T. -- Dr Thomas Conway You are beautiful; but learn to work, drtomc@gmail.com for you cannot eat your beauty. -- Congo proverb

it seems we can almost do this now without adding any new API calls, just have 'thawArray' and 'freezeArray' perform the check, and behave like 'unsafeThawArray' or 'unsafeFreezeArray' when they are the only reference. The compiler may even be able to statically replace some calls to thawArray or freezeArray with the unsafe versions with an extension of the rules syntax {-# RULES forall a | unique a . freezeArray a = unsafeFreezeArray a #-} this syntax actually came up in a previous discussion about wanting to fire rules only when the argument is a constant literal (though, you don't care about the particular constant value it is) I imagine infering the uniqueness of values shouldn't be that hard as a form of it is already done for avoiding thunk-updates. John -- John Meacham - ⑆repetae.net⑆john⑈

On 3/8/07, John Meacham
it seems we can almost do this now without adding any new API calls, just have 'thawArray' and 'freezeArray' perform the check, and behave like 'unsafeThawArray' or 'unsafeFreezeArray' when they are the only reference.
The compiler may even be able to statically replace some calls to thawArray or freezeArray with the unsafe versions with an extension of the rules syntax
{-# RULES forall a | unique a . freezeArray a = unsafeFreezeArray a #-}
this syntax actually came up in a previous discussion about wanting to fire rules only when the argument is a constant literal (though, you don't care about the particular constant value it is)
I imagine infering the uniqueness of values shouldn't be that hard as a form of it is already done for avoiding thunk-updates.
You have to be careful here. Uniqueness typing is not the same as usage analysis but the two are confusingly similar. Imagine a function which takes, say, an array as it's argument. If it says that the array is unique, then there must only be a single reference to that array as the function probably updates it destructively. Compare this to the situation where we say that the array is only used once in the function. The array may have other references to it in this case, the function doesn't care. This boils down to the fact that the usage analysis propagates information to the point where a value is created; should the thunk be updateable or not? Whereas with uniqueness analysis we propagate information to the point where it is used: is it OK to destructively update the value instead of copying it? I'm also not sure that inferring uniqueness types in the compiler would be the way to go. I think David wants to be certain that his arrays are not copied. Having to inspect the output of the compiler is not the nicest way to do this. Some mechanism which enables the programmer to tell the compiler what his intent is (such as the uniqueness type system a la Clean) would be much preferable. Of course, that doesn't mean that your idea is useless. It could still, and probably would be, a very valuable optimization in the compiler. I just don't think that it will help David with his particular problem. Cheers, Josef

John Meacham wrote:
it seems we can almost do this now without adding any new API calls, just have 'thawArray' and 'freezeArray' perform the check, and behave like 'unsafeThawArray' or 'unsafeFreezeArray' when they are the only reference.
The compiler may even be able to statically replace some calls to thawArray or freezeArray with the unsafe versions with an extension of the rules syntax
{-# RULES forall a | unique a . freezeArray a = unsafeFreezeArray a #-}
this syntax actually came up in a previous discussion about wanting to fire rules only when the argument is a constant literal (though, you don't care about the particular constant value it is)
I imagine infering the uniqueness of values shouldn't be that hard as a form of it is already done for avoiding thunk-updates.
GHC doesn't have any kind of uniqueness analysis right now. It's pretty hard to do in general: imagine a function that takes an array as an argument and delivers an array as a result. It'll probably need two versions: one when the argument is unique, one for when it's not. If you can see all the call sites you might be able to throw away one of these versions, but with separate compilation that's the uncommon case. BTW, we don't do any update-avoidance at the moment. The old update analyser was slow and didn't give any significant benefit, so we dropped it. I think the right way forward is array fusion, which is what the NDP folks are working on. Cheers, Simon

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Simon Marlow wrote:
GHC doesn't have any kind of uniqueness analysis right now. It's pretty hard to do in general: imagine a function that takes an array as an argument and delivers an array as a result. It'll probably need two versions: one when the argument is unique, one for when it's not.
What if the compiler might only create the version that requires uniqueness, and callers that don't already have uniqueness must make a copy? (or that could be a trivial wrapper function alongside the main uniqueness-requiring version). Would this be at all significantly worthwhile? Isaac -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF9BRtHgcxvIWYTTURAmDnAJ9+e5M5k6PmmfHJwqpZrsIopNX5tQCg06rk CRsmtjedyOJ1ARvYijYUCp4= =abyA -----END PGP SIGNATURE-----

On Fri, Mar 09, 2007 at 10:24:14AM +0000, Simon Marlow wrote:
GHC doesn't have any kind of uniqueness analysis right now. It's pretty hard to do in general: imagine a function that takes an array as an argument and delivers an array as a result. It'll probably need two versions: one when the argument is unique, one for when it's not. If you can see all the call sites you might be able to throw away one of these versions, but with separate compilation that's the uncommon case.
Ah, yes. I keep on thinking this is used since I studied it carefully as a potential algorithm for jhc... (still undecided, my implementation is too buggy to use in production and jhc has enough bugs as is :) ) http://citeseer.ist.psu.edu/wansbrough98once.html
BTW, we don't do any update-avoidance at the moment. The old update analyser was slow and didn't give any significant benefit, so we dropped it.
I find a general problem when researching ghc is that a lot of past research projects used it as a base and declare things like 'we integrated this into the ghc 4.04 haskell compiler' but it is not clear whether the code actually made it back into the mainline or not.. Perhaps A section of the wiki is in order that lists the most recent paper that describes various parts of what is actually used in the production ghc. perhaps something like type checker : boxy types and impredicativity paper + Wobbly type GADT inference paper optimizer/simplifier : secrets of haskell inliner paper runtime: eval/apply vs push-enter paper garbage collector: non-stop collection for haskell paper fundep implementation: ? concurrency: STM papers + original concurrency paper (are these accurate BTW?) John -- John Meacham - ⑆repetae.net⑆john⑈

On Mon, Mar 12, 2007 at 05:21:46PM -0700, John Meacham wrote:
type checker : boxy types and impredicativity paper + Wobbly type GADT inference paper
Both of those seem to take basic Hindley-Damas-Milner as a prerequisite ... While I've invented two closely related typechecking algorithms, and I'm pretty sure they're both close relatives of HDM, I can't seem to find a readable paper explaining the real HDM algorithm. A pointer to that would be very useful. Stefan

Hello John, Tuesday, March 13, 2007, 3:21:46 AM, you wrote:
garbage collector: non-stop collection for haskell paper
this is not implemented and btw exists as one of haskell SoC tasks. implemented only *multithreading* for old GC -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

| Perhaps A section of the wiki is in order that lists the most recent | paper that describes various parts of what is actually used in the | production ghc. | | perhaps something like | | type checker : boxy types and impredicativity paper + Wobbly type GADT | inference paper | optimizer/simplifier : secrets of haskell inliner paper | runtime: eval/apply vs push-enter paper | garbage collector: non-stop collection for haskell paper | fundep implementation: ? | concurrency: STM papers + original concurrency paper If you put up a draft, I will check it for veracity. (You might want a section for things that are described in the literature, but are *not* in the HEAD. Notably: - Wansbrough's usage analysis - Ennals's optimistic evaluation Simon

my line of work). This got me thinking about one of the largest problems with writing serious numerical code in Haskell, which is that of memory consumption and avoiding temporary variables. ... pretty high-level pure functional code, while the RTS can realize at run-time that we have the only reference to the input argument and that it is therefore safe to consume it destructively. My thought is to move some of the amazing rewriting streams stuff that's going on with Data.ByteStream and Manuel's array fusion work onto the memory management level. If we can rewrite high-level array manipulations into a form in which the compiler can fuse the loops and write good low-level code, why not also allow automatic in-place updates (when possible)?
fusing loops over arrays already avoids temporaries, doesn't it? one thing that struck me when reading the rewriting strings paper was that it reduces structure fusion to loop-over-structure fusion which would finally provide a framework for borrowing all that nice work on with-loop-folding in SAC for Haskell arrays (all of those high-level array operations in SAC seem to be reducable to with-loops). Single Assignment C -- Efficient Support for High-level Array Operations in a Functional Setting. Sven-Bodo Scholz. In Journal of Functional Programming 13(6), pp.1005-1059, ©Cambridge University Press, 2003. http://www.sac-home.org/publications/sac-overview-jfp-03.pdf [section 4.2 in the papger is on with-loop folding] SAC home page, about SAC http://www.sac-home.org/index.php?p=.%2F11_About%2F11_About_SAC anyone for porting the array sublanguage (shape- and dimension-invariant operations on multi-dimensional arrays) of SAC to Haskell? how much of that is covered in the ongoing Haskell array work?
The key is that copy will ask the GC to check whether there is more than one reference to its input, and if there isn't, then it'll return its input without making a copy.
Obviously, we'd want to be careful in making calls to copy (since the GC overhead could overwhelm the potential (transient) memory savings.
SAC uses reference counting for that. as for using forced GC to get similar effects, i'm not optimistic, as that would seem to invalidate the advantages of GC-based memory management. btw, the nice thing about making a copy is that you know you now have the only reference to that copy. so you could make a copy, then update in place to your heart's content, until you pass on your reference to others. claus

On Mar 8, 2007, at 12:00 PM, David Roundy wrote:
Hi all,
I was just teaching a class on minimization methods, with a focus on conjugate gradient minimization in particular, and one of my main points was that with conjugate gradient minimization we only need three or four arrays of size N (depending on whether we use the Fletcher-Reeves or Polak-Ribiere variant), ... This got me thinking about one of the largest problems with writing serious numerical code in Haskell, which is that of memory consumption and avoiding temporary variables.
I've been following this discussion with interest, as I've been looking in some detail at conjugate gradient algorithms as part of my day job, and I've spent a good deal of time thinking about exactly the problems you raise. For those following along at home, here's a sample somewhat-imperative CG algorithm (this is the simplified stripped-down version): for j <- seq(1#cgit) do q = A p alpha = rho / (p DOT q) z += alpha p rho0 = rho r -= alpha q rho := r DOT r beta = rho / rho0 p := r + beta p end Here p,q,r, and z are vectors, A is the derivative of our function (in this case a sparse symmetric positive-definite matrix, but we can really think of it as a higher-order function of type Vector->Vector) and the greek letters are scalars. The "answer" is z. In practice we'd not run a fixed number of iterations, but instead do a convergence test. All the hard work is in the line "q = A p", but the storage consumption is mostly in the surrounding code. On a parallel machine (where these sorts of programs are often run) this part of the algorithm has almost no parallelism---all those dot products and normalizations preclude it---but the A p step and the dot products themselves are of course quite parallelizable. Sadly, many of the suggestions, while generally sound, just don't apply well to this particular use case. * As other have noted, burying the updatable state in a monad just begs the question. The resulting code looks nothing at all like the mathematics, either, which is a big problem if you're trying to understand and maintain it (The above code is virtually isomorphic to the problem specification). I'm sure David is seeking a more- declarative version of the code than the spec from which we were working. Note that rather than embedding all the state in a special monad, we might as well be using update-in-place techniques (such as the += and -= operations you see above) with the monads we've already got; the result will at least be readable, but it will be too imperative. * There are opportunities for loop fusion in CG algorithms---we can compute multiple dot products on the same array in a single loop--- but these have the flavor of fusing multiple consumers of a data structure into a single consumer, which I believe is still an unsolved problem in equational frameworks like foldr/build or streams. It's a bit like fusing: n = foldl' (+) 0 . map (const 1) $ xs sum_xs = foldl' (+) 0 $ xs sum_sq = foldl' (+) 0 . map (\x->x*x) $ xs into: (n,sum_xs,sum_sq) = foldl' (\(a0,b0,c0) (an,bn,cn)->(a0+an, b0+bn, c0+cn)) (0,0,0) . map (\x->(const 1 x, x, x*x)) $ xs which we understand how to do, but not equationally (or at least we didn't last I looked, though Andy Gill and I had both fantasized about possibilities for doing so). None of these fusion opportunities actually save space, which makes the problem tricker still. * The algorithm as written already tries to express minimal storage. The only question is, do +=, -=, and := update their left-hand side in place, or do we think of them (in the Haskell model of the universe) as fresh arrays, and the previous values as newly-created garbage? My challenge to fellow Haskell hackers: find a way to express this such that it doesn't look so imperative. * Even if we do a good job with += and -=, which is what David seems to be looking for, we still have those irritating := assignments--- we'd like to throw out the old p and reuse the space in the last line. And we'd like to have one piece of storage to hold the q on each successive iteration. So even if we solve David's problem, we still haven't matched the space requirements of the imperative code. * DiffArrays are too expensive to be acceptable here. It's not even a question of unboxing. Let's say we keep the current array in fast, unboxed storage; this lets us read its elements using a single load. Each update still needs to retrieve the old data from the array and add it to the old-versions lookup table; together these operations are much more expensive than the actual update to the current version of the table. And we need to do this even though no old versions exist! We should be able to avoid this work entirely. (And, if old versions do exist, a DiffArray is the pessimal representation for them given that we're updating the whole array). * Linear or Uniqueness types are almost what we want. I think Josef Svenningson was the one who captured this the best: Uniqueness type *require* that the *caller* of these routines make sure that it is not sharing the data. So we need two copies of our linear algebra library---one which takes unique arrays as arguments and updates in place, and one which takes non-unique arrays and allocates. And we have to pick the right one based on context. What we want, it seems to me, is one library and a compiler which can make informed judgments. * We could imagine tracking uniqueness dynamically at run time, using something like reference counting for all our arrays. But we need to do the reference counting precisely---this is pretty much the most inefficient way possible of tracking the storage, and doesn't play well at all with using efficient GC elsewhere in our programs. The inefficiency might be worth it for arrays, but Haskell is polymorphic and in many cases we need to treat all our data the same way. * Finally, I'll observe that we often want to use slightly different algorithms depending upon whether we're updating in place or computing into fresh storage. Often copying the data and then updating it in place is not actually a good idea. I'd love to hear if anyone has insights / pointers to related work on any of the issues above; I'm especially keen to learn if there's work I didn't know about on fusion of multiple traversals. In my day job with Fortress we are looking at RULES-like approaches, but they founder quickly because the kind of problems David is trying to solve are 90% of what our programmers want to do. -Jan-Willem Maessen

I've been following this discussion with interest, as I've been looking in some detail at conjugate gradient algorithms as part of my .. I'd love to hear if anyone has insights / pointers to related work on any of the issues above; I'm especially keen to learn if there's work I didn't know about on fusion of multiple traversals. In my day job with Fortress we are looking at RULES-like approaches, but they founder quickly because the kind of problems David is trying to solve are 90% of what our programmers want to do.
you probably know about these two already, but still: - Pascal R. Serrarens used conjugate gradient as a case study for Clean, comparing against C and Haskell (that was back in 1996/7, when Haskell arrays weren't competitive) Implementing the Conjugate Gradient Algorithm in a Functional Language http://www.st.cs.ru.nl/papers/1997/serp97-cgfunctional.ps.gz - in spite of its name, SAC is a functional language; the central with-loop construct unifies several forms of array comprehensions, and their main fusion is named with-loop folding; whether it does the specific kinds of fusions you are after, i don't know, perhaps the paper i mentioned earlier answers that? claus

* The algorithm as written already tries to express minimal storage. The only question is, do +=, -=, and := update their left-hand side in place, or do we think of them (in the Haskell model of the universe) as fresh arrays, and the previous values as newly-created garbage? My challenge to fellow Haskell hackers: find a way to express this such that it doesn't look so imperative.
* Even if we do a good job with += and -=, which is what David seems to be looking for, we still have those irritating := assignments--- we'd like to throw out the old p and reuse the space in the last line. And we'd like to have one piece of storage to hold the q on each successive iteration. So even if we solve David's problem, we still haven't matched the space requirements of the imperative code.
i find it difficult to discuss performance issues without concrete code examples, so i decided to convert Jan-Willem's loop code into Haskell. at first, i just naively translated the loop to recursion over lists, then i tried to produce an update-inplace variant based on some form of arrays, and finally i added a variant based on strict lists (would be nice to have standard libraries for (head-/spine-)strict lists, btw). both the array and strict list versions avoid some intermediate structures; for the arbitrarily invented, relatively small inputs i've tried, strict lists are the clear winner, thanks to lower memory traffic, but i'd like some feedback from the experts: -are there any obvious inefficiencies in the array code? -what does your real code look like, in view of scaling to much larger inputs? -i tried to keep the computations equal, but seem to have introduced a small variance in the strict list version, which i can't seem to spot by staring at it. any ideas? -i assume the memory overhead of binary strict lists is unacceptable for large inputs, but i'm still waiting for those polymorphic bytestrings..;-) while playing with this code, it occurred to me that situations as that described by David (small numbers of huge structures of constant size, with no nested pointers) are less suited for compacting garbage collection than for slot-reusing reference counting. GC wins in the common situation of many variable-sized, frequently created/destroyed structures, by touching only those objects that are live when space runs out, while RC has a higher continuous overhead. as i mentioned earlier, i'm not sure whether update-in-place vs recreate is so bad for whole-array updates, but the real win could be not having to copy the huge live arrays arround at GC time (either from old to new space, or within one space, to create contiguous free slots while compacting the occupied slots). so instead of update-in-place in virtual memory space, which will still be copied at GC time, one would want to pin the virtual region in which the arrays are allocated to a real memory region, and tell the GC to keep its hands off it (none of that space will be released until the loop ends, all of it will be released after the loop ends and a copy of z has been made). does this make sense to the number crunchers and memory managers out there? and is there a way to allocate Haskell arrays to such pinned memory regions? claus

Hello Claus, Sunday, March 11, 2007, 10:03:59 PM, you wrote:
both the array and strict list versions avoid some intermediate structures; for the arbitrarily invented, relatively small inputs i've tried, strict lists are the clear winner, thanks to lower memory traffic, but i'd like some feedback from the experts:
-are there any obvious inefficiencies in the array code?
obviously, arrays version should create no temporary cells. the problems was mainly due to 2 factors: 1) readArray m (i,j) 2) 'op' in 'l' which was passed as real closure and was not inlined due to weakness of ghc optimizer also, we should help strictness analyzer by marking all the variables used in tight loops as strict. after that is done, we got 1000 times less temporary data allocated and 5x faster execution. now it's a bit faster than strict lists -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hi Bulat,
obviously, arrays version should create no temporary cells.
that's why the memory traffic surprised me. i knew there had to be something wrong.
the problems was mainly due to 2 factors: 1) readArray m (i,j)
yes, indeed. since we are dealing in bulk operations, we might as well take advantage of that, so dropping the repeated bounds-checks inside the loops makes a lot of sense. moral/note to self: bulk array operations are your friend (i knew that!-), but you need to use that when defining them (unsafeRead et al are only for library writers, but library writers ought to use them; and i was writing a small inlined library)
2) 'op' in 'l' which was passed as real closure and was not inlined due to weakness of ghc optimizer
it irked me having to write the same loop twice, but i didn't consider the consequences. an INLINE pragma on l almost seems sufficient to regain the loss, so i prefer that; but writing out the loop twice is still a tiny tick faster..
we should help strictness analyzer by marking all the variables used in tight loops as strict.
ah, there were a few surprises in that one. i thought i had considered possible strictness problems, but obviously, i missed a few relevant possibilities. annotating everything is the safe option, and clearly documents the intentions, but i cannot help thinking about which annotations could be omitted: - modArray: a and i are both used anyway - i index in loop is definitely checked (but j isn't, and some others weren't, either; so better safe than sorry) - some of the parameters need not be annotated in this example, but should be if one wanted to reuse the code elsewhere - the one i really don't get yet is the one on the accumulators (!s in l, in dotA/matA); i thought i had covered that by using $! in applying the loop, but annotating the formal loop parameter, apart from being nicer, also seems faster.. moral/note to self: don't try to be clever, try to be clear..; strictness in formal parameters is better than strictness in actual parameters; bang-patterns are good;-)
after that is done, we got 1000 times less temporary data allocated and 5x faster execution. now it's a bit faster than strict lists
is this now anywhere near what the number-crunchers use, when they use Haskell?-) Bulat, thanks for looking into it and for isolating the issues so quickly!-) claus

Hello Claus, Monday, March 12, 2007, 3:35:43 AM, you wrote:
the problems was mainly due to 2 factors: 1) readArray m (i,j)
yes, indeed. since we are dealing in bulk operations, we might as well take advantage of that, so dropping the repeated bounds-checks inside the loops makes a lot of sense.
no, i say here only about memory leaks. of course, unsafeRead omits bounds checking but more important in this case is that readArray created a temporary memory cells - index calculation of matrix turns out to be not strict. it was biggest surprise for me - i thrown a lot of time, adding 'seq' here and there before i even tried to replace this call with unsafeRead
moral/note to self: bulk array operations are your friend (i knew that!-), but you need to use that when defining them (unsafeRead et al are only for library writers, but library writers ought to use them; and i was writing a small inlined library)
switching array operations to unsafe ones raised performance, but, aside of this matrix indexing, don't changed memory usage
2) 'op' in 'l' which was passed as real closure and was not inlined due to weakness of ghc optimizer
it irked me having to write the same loop twice, but i didn't consider the consequences. an INLINE pragma on l almost seems sufficient to regain the loss, so i prefer that; but writing out the loop twice is still a tiny tick faster..
afaik, ghc can't inline recursive functions. it will be great if ghc can automatically make specialized version of function it can't inline. so i'm wonder whether INLINE really helps anything? ... well, after conducting tests i see that INLINE works but manual inlining works even better :))
we should help strictness analyzer by marking all the variables used in tight loops as strict.
ah, there were a few surprises in that one. i thought i had considered possible strictness problems, but obviously, i missed a few relevant possibilities. annotating everything is the
the problem is that arrays also need to be annotated as strict. ghc is not as smart as you think so it really needs that everything will be annotated as strict in order to make strict calls (or may be you just don't take into account that haskell is lazy language and everything by default is lazy, that definitely don't help program to run faster :)
safe option, and clearly documents the intentions, but i cannot help thinking about which annotations could be omitted:
- modArray: a and i are both used anyway
sure? they are just passed into the unsafeRead routine. whether they are *definitely* used or not depends on strictness of unsafeRead in these arguments
- i index in loop is definitely checked (but j isn't, and some others weren't, either; so better safe than sorry)
again, checking in the form of "unless" strictifies your operation
only if we know that 'unless' is strict in its first argument. using
guard makes i definitely strict (and in fact was used before 6.6 to
"declare" parameters as strict):
l i | i - some of the parameters need not be annotated in this example, but should be if one
wanted to reuse the code elsewhere - the one i really don't get yet is the one on the accumulators (!s in l, in dotA/matA);
i thought i had covered that by using $! in applying the loop, but annotating the
formal loop parameter, apart from being nicer, also seems faster.. it is a very different things! when you use $! in call, you ensure
that parameter will be calculated *before* making actual call -
otherwise, 'l' will be called with a closure that calculates actual
value on demand. so this omits creating a closure and therefore
conserves memory. it is all we need to make memory usage as small as
possible
but unfortunately, this don't changes a 'l' itself. it remains lazy
in is argument, so our strict call will just add box around computed
value before passing it to 'l'
otoh, when ghc knows that 'l' is strict in its parameter (due to
strictness analysis or with help of our annotation), it creates l', a
strict version of l whose actual parameter has Int# type. all calls to
'l' just deboxify their parameter and call l'. of course, this means
that your entire computation of new index/accumulator will be done
using Int# values and result will be passed to l' without additional
boxing and unboxing. you see a difference :))) moral/note to self: don't try to be clever, try to be clear..; strictness in formal
parameters is better than strictness in actual parameters; bang-patterns are good;-) when i saw your code, i've thought that we should add more strictness to
fix memory leaks and unsafe* to become a bit faster. memory leak on
matrix read was surprise for me too after that is done, we got 1000 times less temporary data allocated and 5x faster
execution. now it's a bit faster than strict lists is this now anywhere near what the number-crunchers use, when they use Haskell?-) it will be interesting if someone like Don or Simon can further improve
it, i'm learned a lot of tricks from them :) Bulat, thanks for looking into it and for isolating the issues so quickly!-)
claus it was a challenge! i, as moderator of arrays wiki page, can't allow
any other data structures to be faster :)))
--
Best regards,
Bulat mailto:Bulat.Ziganshin@gmail.com

1) readArray m (i,j) yes, indeed. since we are dealing in bulk operations, we might as well take advantage of that, so dropping the repeated bounds-checks inside the loops makes a lot of sense.
no, i say here only about memory leaks. of course, unsafeRead omits bounds checking but more important in this case is that readArray created a temporary memory cells - index calculation of matrix turns out to be not strict. it was biggest surprise for me - i thrown a lot of time, adding 'seq' here and there before i even tried to replace this call with unsafeRead
i'm not so sure about that conclusion;) i admit that i more often optimise for time than for space, so unless there's a real space leak to get rid of, i tend to measure space performance indirectly, by its impact on time, which is perhaps not a good habit. but i did a step-by-step rerun of the modifications, to see their effect on (time) performance: time ./CG array 100000: 33s time ./CG_Bulat array 100000: 8s 33s: baseline, my original code 30 - strict formal pars in l, in dotA/matA 22 - inline l, for +*=/-*= 14 - replace readArray m (i,j) by unsafeRead m (index .. (i,j)), replace index by unsafeIndex, eliminating bounds-check 12 - same for readArray/writeArray v 12 - eliminating the tuple in readMatrix makes no difference 8 - seq-ing all parameters in l,*+=,dotA,matA to handle the 2d indexing, i replaced readArray m (i,j) by readMatrix m (i,j): {-# INLINE readMatrix #-} readMatrix m ij = unsafeRead m (unsafeIndex matrixBounds ij) matrixBounds :: ((Int,Int),(Int,Int)) matrixBounds = ((1,1),(n,n)) so we're still dealing with pairs, just got rid of the bounds-checks in readArray/index, and that alone brings us from 22s to 14s (12s if we do the same for vectors), a substantial improvement. eliminating the tuples, passing i and j directly into the computation, doesn't seem to make any further difference (shifting the indices to avoid the decrements might, but not by much, certainly not enough to justify extending the arrays;-), so just getting rid of the bounds-check had sufficiently exposed the index computation already. -- readMatrix m i j = unsafeRead m $! ((i-1)*n+(j-1)) ensuring strictness of all formal parameters in the basic vector/matrix operations, through bang-patterns or seq, brings us from 33s to 30s, and from 12s to 8s, so that helps a lot. the inline pragma on l brings us from 30s to 22s, so that helps a lot, too.
afaik, ghc can't inline recursive functions. it will be great if ghc can automatically make specialized version of function it can't inline. so i'm wonder whether INLINE really helps anything?
perhaps it can't unroll the (conceptually infinite) body of the loop, but it can bring copies of the definition to the places where the op parameters are known.
(let f x = .. in f $! par) vs (let f !x = .. in f par)
so the difference is between passing evaluated parameters into functions that don't expect them and passing parameters to functions that expect them evaluated. thanks, that makes sense to me: apart from the boxing/unboxing of evaluated parameters, the function body itself might look different. thanks, claus

to handle the 2d indexing, i replaced readArray m (i,j) by readMatrix m (i,j):
{-# INLINE readMatrix #-} readMatrix m ij = unsafeRead m (unsafeIndex matrixBounds ij)
matrixBounds :: ((Int,Int),(Int,Int)) matrixBounds = ((1,1),(n,n))
i'm still trying to understand why unsafeIndex is so much faster than index. to prepare for further experiments, i tried to include the default implementation for method index in my source (which includes all the other optimisations discussed here), as myindex: {-# INLINE readMatrix #-} readMatrix m ij = unsafeRead m (index matrixBounds ij) myindex b i | inRange b i = unsafeIndex b i | otherwise = error "Error in array index" now, the measurement that confuses me is this: time ./CG array 100000 readMatrix calls index: 16s readMatrix calls myindex: 9s so just calling an in-module copy of the default code for index, with bounds-check, is almost as fast as calling unsafeIndex, and almost twice as fast as calling the official index.. can anyone explain this effect? claus

Hello Claus, Monday, March 12, 2007, 6:03:28 PM, you wrote:
readMatrix calls index: 16s readMatrix calls myindex: 9s
so just calling an in-module copy of the default code for index, with bounds-check, is almost as fast as calling unsafeIndex, and almost twice as fast as calling the official index..
moreover, look at +RTS -s results. i'm pretty sure that you will find huge difference in memory allocation :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Fri, Mar 09, 2007 at 10:58:43AM -0500, Jan-Willem Maessen wrote:
* Linear or Uniqueness types are almost what we want. I think Josef Svenningson was the one who captured this the best: Uniqueness type *require* that the *caller* of these routines make sure that it is not sharing the data. So we need two copies of our linear algebra library---one which takes unique arrays as arguments and updates in place, and one which takes non-unique arrays and allocates. And we have to pick the right one based on context. What we want, it seems to me, is one library and a compiler which can make informed judgments.
wansbrough's simple usage polymorphism paper details an analysis that infers types that are 'polymorphic' in their usage qualities. The paper brings up the possibility of generating specialized versions of functions 'usage-specialization' based on these types. which seems to be exactly the sort of thing you would want out of a compiler for the above. http://citeseer.ist.psu.edu/292462.html it would also seem like it would be possible to actually pass these usage types at run-time and choose operations based on that. This seems like it would be an ideal way to go, it would behave in a very predictable way for haskell programmers used to type specialization and haskell classes and the middle-end work wouldn't be as much as system f is already typed and these usage types would behave like other type variables in a lot of ways after their initial inference. usage specialization could occur via the same RULES mechanism that type specialization uses now...
I'd love to hear if anyone has insights / pointers to related work on any of the issues above; I'm especially keen to learn if there's work I didn't know about on fusion of multiple traversals. In my day job with Fortress we are looking at RULES-like approaches, but they founder quickly because the kind of problems David is trying to solve are 90% of what our programmers want to do.
I think the simple usage polymorphism paper (and the thesis that goes along with it) can provide some fertile ideas for exploration here.. John -- John Meacham - ⑆repetae.net⑆john⑈
participants (18)
-
Brandon S. Allbery KF8NH
-
Bulat Ziganshin
-
Claus Reinke
-
Dan Piponi
-
Dan Weston
-
David Roundy
-
dons@cse.unsw.edu.au
-
Isaac Dupree
-
Jan-Willem Maessen
-
John Meacham
-
Josef Svenningsson
-
Ketil Malde
-
Matthias Neubauer
-
Sebastian Sylvan
-
Simon Marlow
-
Simon Peyton-Jones
-
Stefan O'Rear
-
Thomas Conway