DiffArray Performance

newer
Special Invitation :-) HC&A Report...

Adrian Hey

27 Oct 2003 27 Oct '03

11:25 a.m.

Hello, I've been trying to optimise the following code.. -- Search Tree data type newtype STree = STree (Array Int (STree,[Match])) -- Initial value for Search Tree sTree0 :: STree sTree0 = STree (array (0,9) [(n,(sTree0,[]))| n <- [0..9]]) -- Make the search tree from a list of words makeSTree :: [String] -> STree makeSTree ws = foldl' putWord sTree0 pairs where pairs = [let ps = packString w in ps `seq` (word2keys w, MatchW ps) | w<-ws] word2keys cs = [getKey (toUpper c) | c <- cs, c /= '"' , c /= '-' ] putWord stree (keys,m) = put keys stree where put [] _ = error "makeSTree: empty Keys" put [k] (STree a) = let (t,ms) = a ! k in STree (a // [(k,(t,m:ms))]) put (k:ks) (STree a) = let (t,ms) = a ! k t' = put ks t in t' `seq` STree (a // [(k,(t',ms))]) This seems to be taking about 4.8 seconds (of 5.1 seconds total program execution time) for the input I'm using. I thought using DiffArrays might be faster, but no such luck. Execution time rose to 9.5 *minutes*. Is this what I should expect to see? I'm using ghc 6.0, BTW. Thanks -- Adrian Hey

Show replies by date

Alastair Reid

27 Oct 27 Oct

3:17 p.m.

...

[Binary tree] seems to be taking about 4.8 seconds (of 5.1 seconds total program execution time) for the input I'm using. I thought using DiffArrays might be faster, but no such luck. Execution time rose to 9.5 *minutes*.

Is this what I should expect to see?

You don't mention _how_ you use diffarrays so I'll guess you're keeping the array in ascending order, finding the insertion point by binary search and inserting by copying bigger elements one place to the right. Using binary trees (the code you copied), the cost of N insertions will be O(N * log N) assuming the input is in random order. Using 'true' arrays (all operations constant time), the cost of copying values will dominate and inserting N elements will take O(N^2) time. So, even in ideal circumstances, I'd expect it to be much slower for large N. I think DiffArrays where most elements have been assigned to more than once have constant time update but linear time lookup. So the total cost of moving the elements when you insert is still O(N^2) but finding where to insert will cost something closer to O(N * N * log N) since the cost of a single binary search will be O(N * log N). Ignoring all the constant factors (a dubious thing to do) and choosing N=10^6, these relative costs look something like this: binary tree: 10^6 * 20 = 2 * 10^7 true array: 10^6 * 10^6 = 10^12 diff array: 10^6 * 10^6 * 20 = 2 * 10^13 (The constant overheads would have to be pretty large to mask differences like this.) -- Alastair Reid www.haskell-consulting.com

Adrian Hey

6:21 p.m.

On Monday 27 Oct 2003 3:17 pm, Alastair Reid wrote:

...

You don't mention _how_ you use diffarrays so I'll guess you're keeping the array in ascending order, finding the insertion point by binary search and inserting by copying bigger elements one place to the right.

...

Using binary trees (the code you copied), the cost of N insertions will be O(N * log N) assuming the input is in random order.

Thanks for your reply. Maybe I'm being dense, but I'm affraid I don't understand the relevance of your answer to my problem :-( I guess I should have been a bit clearer in my explanation of what the code is doing. Here's the code again..

...

-- Search Tree data type newtype STree = STree (Array Int (STree,[Match])) -- Initial value for Search Tree sTree0 :: STree sTree0 = STree (array (0,9) [(n,(sTree0,[]))| n <- [0..9]])

-- Make the search tree from a list of words makeSTree :: [String] -> STree makeSTree ws = foldl' putWord sTree0 pairs where pairs = [let ps = packString w in ps `seq` (word2keys w, MatchW ps) | w<-ws] word2keys cs = [getKey (toUpper c) | c <- cs, c /= '"' , c /= '-' ] putWord stree (keys,m) = put keys stree where put [] _ = error "makeSTree: empty Keys" put [k] (STree a) = let (t,ms) = a ! k in STree (a // [(k,(t,m:ms))]) put (k:ks) (STree a) = let (t,ms) = a ! k t' = put ks t in t' `seq` STree (a // [(k,(t',ms))])

This generates a 10 way (Decary?) search tree, each vertex of which is an ordinary Haskell98 Array of 10 elements. Each edge is annotated by any string which encode a particular sequence of decimal digits (keys), which give the path to the vertex in question. To find any words which encode a particular sequence of digits the corresponding search function (not included above) just looks down the relevant branches (1 for each digit), until no more digits are left. Whatever annotations are on the last edge are the words that encode the digit sequence (if any). The search is very fast, it's the construction of the search tree in the first place that seems a bit slow (extremely slow using DiffArrays). The test used DiffArray instead of a Haskell98 array..

...

-- Search Tree data type newtype STree = STree (DiffArray Int (STree,[Match]))

The code is otherwise identical, so any difference in execution time must be caused the difference between reading/writing the respective arrays. I wouldn't expect them to be identical but for DiffArrays to be over 100 times slower seems a bit strange (especially for a relatively small array of 10 elements). That's an O(n^2) difference somewhere (neglecting any constant factors). I'm just wondering if there's a bug in their implementation? or am I using them incorrectly? and generally seeking advice about any faster arrays I could try. Regards -- Adrian Hey

Adrian Hey

28 Oct 28 Oct

2:46 p.m.

Hello again, Another thought.. Could it be that sTree0 is cyclic that accounts for this dramatic slow down? I'm not to sure how DiffArray are implemented, but if it's how I would do it it seems you would end up with a massive chain of indirections. Actually, it's probably not a good idea to have a DiffArray as a top level CAF, cyclic or otherwise? Hmm.. I think that must be it. On Monday 27 Oct 2003 6:21 pm, Adrian Hey wrote:

...

...
-- Search Tree data type newtype STree = STree (Array Int (STree,[Match])) -- Initial value for Search Tree sTree0 :: STree sTree0 = STree (array (0,9) [(n,(sTree0,[]))| n <- [0..9]])

<snip>

...

The code is otherwise identical, so any difference in execution time must be caused the difference between reading/writing the respective arrays. I wouldn't expect them to be identical but for DiffArrays to be over 100 times slower seems a bit strange (especially for a relatively small array of 10 elements). That's an O(n^2) difference somewhere (neglecting any constant factors).

7925

Age (days ago)

7926

Last active (days ago)

List overview

Download

3 comments

2 participants

participants (2)

Adrian Hey
Alastair Reid