Laziness and Either

Back when I was working on the logic for the bin-packing solver that I added to MissingH (for use with datapacker), I had a design decision to make: do I raise runtime errors with the input using error, or do I use an Either type to return errors? Initially, for simplicity, I just used error. But when I did a simple refactoring to use Either, it occurred to me that this switch likely had a negative impact on laziness. In this particular algorithm, we cannot tell for sure that we have no errors until we have consumed all items of the input list. This is unlike, say, a safe version of "head" where you can tell whether you have an error just by whether you have an empty list or not. In the case of using "error", we can happily process the data assuming everything will be fine, and raise an error if and when it is encountered. By using Either, however, any pattern match on the Left/Right result is going to force the entire input to be evaluated so that we can know whether or not it had any error. Is this analysis sensible? If so, are there better solutions? BTW, here are links to the code I'm talking about: http://git.complete.org/datapacker?a=blob;f=Scan.hs;h=d4e8ac8d6e883f342096a4... line 32-43 -- version that uses error http://git.complete.org/datapacker?a=blob;f=Scan.hs;h=acd53739f5d871a3ce7ae6... line 32-46 -- version that uses Either http://git.complete.org/datapacker?a=commitdiff;h=0d4e3ee6c4e5c780009ad5c09d... -- diff between them -- John

On Apr 21, 2008, at 11:18 AM, John Goerzen wrote:
In the case of using "error", we can happily process the data assuming everything will be fine, and raise an error if and when it is encountered. By using Either, however, any pattern match on the Left/Right result is going to force the entire input to be evaluated so that we can know whether or not it had any error.
Is this analysis sensible? If so, are there better solutions?
In another thread back on March 12 ("Exception handling when using STUARrray") I read `Since the decision between Left and Right requires all parts to be evaluated, it's Either that might too strict for your application. What about a writer monad, where exceptions, or better say warnings, are written to the writer stream?' I don't believe I ever got around to evaluating that recommendation (lazy evaluation strategy, indeed.) I think I would use "error". Donn Cave, donn@u.washington.edu

On Mon, 21 Apr 2008, Donn Cave wrote:
On Apr 21, 2008, at 11:18 AM, John Goerzen wrote:
In the case of using "error", we can happily process the data assuming everything will be fine, and raise an error if and when it is encountered. By using Either, however, any pattern match on the Left/Right result is going to force the entire input to be evaluated so that we can know whether or not it had any error.
Is this analysis sensible? If so, are there better solutions?
In another thread back on March 12 ("Exception handling when using STUARrray") I read `Since the decision between Left and Right requires all parts to be evaluated, it's Either that might too strict for your application. What about a writer monad, where exceptions, or better say warnings, are written to the writer stream?'
I don't believe I ever got around to evaluating that recommendation (lazy evaluation strategy, indeed.) I think I would use "error".
Was this advise from me? I would advise the Writer monad again. 'error' should be used only for programming errors, not for unexpected data or situations from the outside world. http://www.haskell.org/haskellwiki/Exception http://www.haskell.org/haskellwiki/Error

But when I did a simple refactoring to use Either, it occurred to me that this switch likely had a negative impact on laziness.
Yes, probably.
Is this analysis sensible? If so, are there better solutions?
I notice that your Maybe-based code is written in a monadic style. If you also use Applicative Functors with an appropriately lazy instance for Maybe, you may be able to be both lazy (where needed) and strict (where needed) just by mixing and matching do-notation with apply- notation. The technique is explored in this paper about partial parsing: http://www-users.cs.york.ac.uk/~malcolm/partialparse.html Regards, Malcolm

John Goerzen wrote:
Back when I was working on the logic for the bin-packing solver that I added to MissingH (for use with datapacker), I had a design decision to make: do I raise runtime errors with the input using error, or do I use an Either type to return errors?
Initially, for simplicity, I just used error. But when I did a simple refactoring to use Either, it occurred to me that this switch likely had a negative impact on laziness.
In this particular algorithm, we cannot tell for sure that we have no errors until we have consumed all items of the input list. This is unlike, say, a safe version of "head" where you can tell whether you have an error just by whether you have an empty list or not.
In the case of using "error", we can happily process the data assuming everything will be fine, and raise an error if and when it is encountered. By using Either, however, any pattern match on the Left/Right result is going to force the entire input to be evaluated so that we can know whether or not it had any error.
Is this analysis sensible? If so, are there better solutions?
This reminds me of a problem I had in dataenc[1], when trying to decode large files I ran out of memory. It was simple to track that down to my decision to use Maybe for the return type. Originally I only had one function for decoding: decode :: String -> Maybe [Word8] decode = sequence . decode' In order to allow lazy decoding I ended up exporting decode' as well: decode' :: String -> [Maybe Word8] Unfortunately that pushes handling of decoding errors to the layer above. Not ideal maybe, but it was the easiest solution. Duncan mentioned that iconv took a similar route to provide laziness[2]. I'm curious to see whether Malcom's paper can offer some further insight. /M [1]: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/dataenc [2]: http://www.haskell.org/pipermail/haskell-cafe/2007-December/036282.html -- Magnus Therning (OpenPGP: 0xAB4DFBA4) magnus@therning.org Jabber: magnus.therning@gmail.com http://therning.org/magnus What if I don't want to obey the laws? Do they throw me in jail with the other bad monads? -- Daveman

On Mon April 21 2008 3:26:04 pm Magnus Therning wrote:
In order to allow lazy decoding I ended up exporting decode' as well:
decode' :: String -> [Maybe Word8]
I take it that in a situation like this, you'd have either: [] -- success with empty result a list full of Just x -- success with valid results a list with 0 or more Just x, followed by one Nothing -- an error Makes sense to me. What impact does this have on performance? Also, I wonder if there is some call for tools in Data.Either to support this type of usage? For example: type EitherList a b = [Either a b] then some functions such as, say, mapEither or foldEither that act like try/catch: a special function to use for an exception, and otherwise they "unwrap" the Right side passing it along to others. -- John

John Goerzen wrote:
On Mon April 21 2008 3:26:04 pm Magnus Therning wrote:
In order to allow lazy decoding I ended up exporting decode' as well:
decode' :: String -> [Maybe Word8]
I take it that in a situation like this, you'd have either:
[] -- success with empty result
a list full of Just x -- success with valid results
a list with 0 or more Just x, followed by one Nothing -- an error
Makes sense to me. What impact does this have on performance?
I think that using [Maybe a] for this purpose is too fine-grained, I would use a custom list type data River a = a :< (River a) | Done | Error (I didn't want to call it Stream because that name is too overloaded already and PartialList is too long :) The three constructors correspond to the three cases you mention. In particular, Error takes the role of the last Nothing . In other words, we just replace the usual end of list [] with another data type. Thus, the general version is data River b a = a :< (River a) | End b Of course, this type is isomorphic to River b a ~ (b, [a]) The latter just puts the end result up front which is the original idea for lazy parsing: report the error b but also return a (partial) result [a] .
Also, I wonder if there is some call for tools in Data.Either to support this type of usage? For example:
type EitherList a b = [Either a b]
then some functions such as, say, mapEither or foldEither that act like try/catch: a special function to use for an exception, and otherwise they "unwrap" the Right side passing it along to others.
The River thing has the drawback that you have to rewrite all the standard list functions, but since you're ready to accept that for in the case of [Either a b] anyway, you can as well use the River thing. Regards, apfelmus

On Wed, Apr 23, 2008 at 01:12:16PM +0200, apfelmus wrote:
I think that using [Maybe a] for this purpose is too fine-grained, I would use a custom list type
data River a = a :< (River a) | Done | Error
(I didn't want to call it Stream because that name is too overloaded already and PartialList is too long :) The three constructors correspond to the three cases you mention. In particular, Error takes the role of the last Nothing .
That sounds like a good idea. But I'd call it Creek, because a river is present year-round, while a creek sometimes dries up in the summer. :) And I'd also vote for adding a String (or more generic parameter) to the Error type, so you can give some sort of reporting when something goes wrong. String is appealing, because then you could make Creek a monad, and fail could generate a nice Error message (assuming fail = Error). Of course, you could take the silly metaphor even further data Creek a = a :< (Creek a) | Ocean | Drought String :) -- David Roundy Department of Physics Oregon State University

On Mon, 2008-04-21 at 21:26 +0100, Magnus Therning wrote:
John Goerzen wrote:
Back when I was working on the logic for the bin-packing solver that I added to MissingH (for use with datapacker), I had a design decision to make: do I raise runtime errors with the input using error, or do I use an Either type to return errors?
Initially, for simplicity, I just used error. But when I did a simple refactoring to use Either, it occurred to me that this switch likely had a negative impact on laziness.
In this particular algorithm, we cannot tell for sure that we have no errors until we have consumed all items of the input list. This is unlike, say, a safe version of "head" where you can tell whether you have an error just by whether you have an empty list or not.
In the case of using "error", we can happily process the data assuming everything will be fine, and raise an error if and when it is encountered. By using Either, however, any pattern match on the Left/Right result is going to force the entire input to be evaluated so that we can know whether or not it had any error.
Is this analysis sensible? If so, are there better solutions?
This reminds me of a problem I had in dataenc[1], when trying to decode large files I ran out of memory. It was simple to track that down to my decision to use Maybe for the return type. Originally I only had one function for decoding:
decode :: String -> Maybe [Word8] decode = sequence . decode'
In order to allow lazy decoding I ended up exporting decode' as well:
decode' :: String -> [Maybe Word8]
Unfortunately that pushes handling of decoding errors to the layer above. Not ideal maybe, but it was the easiest solution.
Duncan mentioned that iconv took a similar route to provide laziness[2].
I'm curious to see whether Malcom's paper can offer some further insight.
/M
[1]: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/dataenc [2]: http://www.haskell.org/pipermail/haskell-cafe/2007-December/036282.html
Indeed, years ago this is exactly what I suggested as a possible solution. See http://web.archive.org/web/20061011010615/http://www.haskell.org/hawiki/Tyin... near the bottom.

On Mon, Apr 21, 2008 at 2:18 PM, John Goerzen
Back when I was working on the logic for the bin-packing solver that I added to MissingH (for use with datapacker), I had a design decision to make: do I raise runtime errors with the input using error, or do I use an Either type to return errors?
Initially, for simplicity, I just used error. But when I did a simple refactoring to use Either, it occurred to me that this switch likely had a negative impact on laziness.
In this particular algorithm, we cannot tell for sure that we have no errors until we have consumed all items of the input list. This is unlike, say, a safe version of "head" where you can tell whether you have an error just by whether you have an empty list or not.
In the case of using "error", we can happily process the data assuming everything will be fine, and raise an error if and when it is encountered. By using Either, however, any pattern match on the Left/Right result is going to force the entire input to be evaluated so that we can know whether or not it had any error.
Is this analysis sensible? If so, are there better solutions?
From my perspective, the version using Either is lazier. By delaying any further processing until the input is known to be correct, it avoids doing unnecessary work.
On the other hand, if space is more of a concern than time, then you
could combine the processing with a left fold that accumulates an
answer.
process :: [Input] -> Either Error [Output]
-- lazy
processAccum :: (seed -> Input -> seed) -> seed -> [Input] -> Either Error seed
-- thrifty
The type of 'processAccum' is essentially an encoding of the River
type apfelmus proposed. It's also related to Oleg's left-fold
enumerator pattern[1].
[1] http://okmij.org/ftp/Computation/Continuations.html#enumerator-stream
--
Dave Menendez
participants (9)
-
apfelmus
-
David Menendez
-
David Roundy
-
Derek Elkins
-
Donn Cave
-
Henning Thielemann
-
John Goerzen
-
Magnus Therning
-
Malcolm Wallace