Takusen and strictness

older
Re: [Haskell] Laziness and the IO...

Paul Moore

2 Mar 2007 2 Mar '07

11:55 a.m.

I'm still fiddling with simple database queries using Takusen. One question I have is regarding strictness. I've just been reading "Haskell IO for Imperative Programmers" and loved the idea that laziness (rather than monads) is what makes IO in Haskell so simple. What I'm not sure about, is whether Takusen offers a way of lazily getting data from a database (and a meta-question - how do I work this out for mysefl?!) With the code {-# OPTIONS -fglasgow-exts #-} import Database.Oracle.Enumerator import Database.Enumerator runSql :: String -> IO [String] runSql s = withSession (connect "USER" "PASSWD" "DB") ( do let iter (s::String) accum = result (s : accum) r <- doQuery (sql s) iter [] return r ) main :: IO () main = do r <- runSql "select username from all_users" putStrLn (unlines r) I can do a simple database query, process the results, and produce output. The sreucture of main is much like an interact loop, but with the input from a database rather than stdin. But, will this read the database lazily, or will it get all the rows into memory at once? How will using result' instead of result (in runSql) affect this? And as I said above, how can I learn to work this out for myself? I know I could just ignore the issue until performance becomes a problem and then profile - but that ignores the question of whether I've got the idiom right. Thanks for any help, Paul.

Show replies by date

Bayley, Alistair

2 Mar 2 Mar

1:50 p.m.

...

From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Paul Moore

But, will this read the database lazily, or will it get all the rows into memory at once? How will using result' instead of result (in runSql) affect this? And as I said above, how can I learn to work this out for myself?

...

runSql :: String -> IO [String] runSql s = withSession (connect "USER" "PASSWD" "DB") ( do let iter (s::String) accum = result (s : accum) r <- doQuery (sql s) iter [] return r )

The iteratee function:

...

let iter (s::String) accum = result (s : accum)

is fed one row at a time. Takusen can fetch rows from the DBMS one at a time (rather slow across a network) or fetch them in chunks (i.e. cached) and feed them to the iteratee one at a time, which is obviously much better from a network performance perspective. Currently, the default is to fetch rows in chunks of 100. So getting rows from the database is more-or-less "on-demand", but it's transparent from your point-of-view, because your iteratee only gets them one a a time. What you're interested in, I think, is what the iteratee does with the data. In your case, it conses each username onto the front of a list, which is initially empty. Because you're using result (not result') this consing is lazy, but doQuery still builds a huge tree of unevaluated thunks, because it's operating in the IO monad (well, the DBM monad, but you get the idea). That big tree of unevaluated thunks may give you trouble if the number of rows in the result-set is large. This is why we recommend the result' function: it uses $! to force the cons to be strict. It may not matter so much for lists and other data structures, but for arithmetic you certainly do not want to build up a huge tree of thunks, which is what happens if you use result (non-strict) and "+" (say). If you don't need the entire list at once, then push your processing into the iteratee. You are not obliged to return all of the data from doQuery in one big data structure, like a list. You can do IO in the iteratee, and even just return (). If you want to terminate the fetch early, you can return (Left <something>) in the iteratee, rather than (Right <something>). result and result' always return (Right <something>), so they process the result-set all the way through. I'm sure someone asked Oleg and I about lazy result-set processing in Takusen (and why it's not done) a few months ago (a private email, I think), but right now I'm unable to find our response. Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

Paul Moore

2:04 p.m.

On 02/03/07, Bayley, Alistair wrote: [...]

...

What you're interested in, I think, is what the iteratee does with the data.

That's correct.

...

In your case, it conses each username onto the front of a list, which is initially empty. Because you're using result (not result') this consing is lazy, but doQuery still builds a huge tree of unevaluated thunks, because it's operating in the IO monad (well, the DBM monad, but you get the idea). That big tree of unevaluated thunks may give you trouble if the number of rows in the result-set is large. This is why we recommend the result' function: it uses $! to force the cons to be strict. It may not matter so much for lists and other data structures, but for arithmetic you certainly do not want to build up a huge tree of thunks, which is what happens if you use result (non-strict) and "+" (say).

That's the impression I got - originally, I was using result' but I thought I'd try result to see if it would improve the laziness. The I realised that, short of profiling memory use and/or using a bigger query, I couldn't actually tell :-)

...

If you don't need the entire list at once, then push your processing into the iteratee.

Hmm, that's what I was trying to avoid. The article I mentioned made a strong point that laziness allows you to factor out processing from IO - so you can write (for example) main = do s <- getContents let r = map processIt (lines s) putStr (unlines r) and laziness means that IO is performed "on demand", so that the above code never has to read the whole input into memory. I was hoping to do something similar for database access, with runSql taking the place of getContents. Having to incorporate "processIt" into the database access code breaks this idiom.

...

You are not obliged to return all of the data from doQuery in one big data structure, like a list. You can do IO in the iteratee, and even just return ().

That's what my earlier code looked like, and I found it harder to understand than the getContents/process/put approach. I'm trying to explore ways of factoring data manipulation code out of database access functions, but maybe that's not the right way of doing it.

...

I'm sure someone asked Oleg and I about lazy result-set processing in Takusen (and why it's not done) a few months ago (a private email, I think), but right now I'm unable to find our response.

That sounds like it's the same as what I'm trying to ask, so if you do find the response, I'd be interested. Thanks for the explanation. Paul.

Bayley, Alistair

2:48 p.m.

...

From: Paul Moore [mailto:p.f.moore@gmail.com]

...
If you don't need the entire list at once, then push your processing into the iteratee.

Hmm, that's what I was trying to avoid. The article I mentioned made a strong point that laziness allows you to factor out processing from IO - so you can write (for example)

main = do s <- getContents let r = map processIt (lines s) putStr (unlines r)

and laziness means that IO is performed "on demand", so that the above code never has to read the whole input into memory. I was hoping to do something similar for database access, with runSql taking the place of getContents. Having to incorporate "processIt" into the database access code breaks this idiom.

There's a big difference between getContents and Takusen: getContents has a non-trivial implementation (using unsafeInterleaveIO) that allows it to return data lazily. Takusen has no such implementation. I'm not sure if it would be possible. I don't really understand how getContents works; is there any advice or guidelines as to how to use (or abuse) unsafeInterleaveIO? Some googling has found: http://therning.org/magnus/archives/249 http://www.haskell.org/pipermail/haskell-cafe/2007-January/021373.html http://www.haskell.org/pipermail/haskell-cafe/2007-January/021407.html http://haskell.org/haskellwiki/IO_inside#unsafePerformIO_and_unsafeInter leaveIO

...

That's what my earlier code looked like, and I found it harder to understand than the getContents/process/put approach. I'm trying to explore ways of factoring data manipulation code out of database access functions, but maybe that's not the right way of doing it.

I don't think it's possible to pursue this style of programming with Takusen. If you do, you'll have to process the entire result-set into a data structure and then process it, which has obvious memory implications. Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

Paul Moore

2:58 p.m.

On 02/03/07, Bayley, Alistair wrote:

...

There's a big difference between getContents and Takusen: getContents has a non-trivial implementation (using unsafeInterleaveIO) that allows it to return data lazily. Takusen has no such implementation.

... ie, there's deep dark magic involved in the seemingly simple getContents, which isn't easily available to mere mortals (or even semi-immortal library designers). That figures. It's a shame, but not totally unsurprising.

...

...
That's what my earlier code looked like, and I found it harder to understand than the getContents/process/put approach. I'm trying to explore ways of factoring data manipulation code out of database access functions, but maybe that's not the right way of doing it.

I don't think it's possible to pursue this style of programming with Takusen. If you do, you'll have to process the entire result-set into a data structure and then process it, which has obvious memory implications.

Oh, well. It's mostly irrelevant for me anyway, as the data sets I'm actually playing with are small enough that slurping them into memory isn't an issue - so I just choose between a simple and decoupled implementation or a more complex and scalable one, which is a fairly standard optimisation choice. Thanks for clarifying. Paul.

Nicolas Frisby

3:10 p.m.

The "deep, dark, Aslan magic" of getContents is usually safe to use because it's a read-only operation. Some of the dangerous corners of getContents are: what happens if the file is altered while we read it lazily? This is the sort of question that the sequencing notion of the IO monad is supposed to solve by saying: we read it all at once, so no one can alter it while we do that. (getContents might lock the file--I don't know, but that would introduce its own corner cases). Can a database fetch have such magic? I think yes, it could probably be implemented the exact same way as getContents. However, whereas concurrent access to files on a filesystem is rare and hence a corner case, concurrent access to a database is usually the norm, and hence not a corner case. In that sense, runSqlLazily might be a helpful function that is often dangerous for A-Consistency-ID reasons. However, ACID-like concerns are not new for DB people, so if the DB server supports transactions, then perhaps Takusen could set up a server-maintained transaction and then read it on-demand. At this point, my arms have become weary from all of the hand-waiving, so I'll ask for discussion now... On 3/2/07, Paul Moore wrote:

...

On 02/03/07, Bayley, Alistair wrote:

...
There's a big difference between getContents and Takusen: getContents has a non-trivial implementation (using unsafeInterleaveIO) that allows it to return data lazily. Takusen has no such implementation.

... ie, there's deep dark magic involved in the seemingly simple getContents, which isn't easily available to mere mortals (or even semi-immortal library designers). That figures. It's a shame, but not totally unsurprising.

...
...
That's what my earlier code looked like, and I found it harder to understand than the getContents/process/put approach. I'm trying to explore ways of factoring data manipulation code out of database access functions, but maybe that's not the right way of doing it.

I don't think it's possible to pursue this style of programming with Takusen. If you do, you'll have to process the entire result-set into a data structure and then process it, which has obvious memory implications.

Oh, well. It's mostly irrelevant for me anyway, as the data sets I'm actually playing with are small enough that slurping them into memory isn't an issue - so I just choose between a simple and decoupled implementation or a more complex and scalable one, which is a fairly standard optimisation choice.

Thanks for clarifying. Paul. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Stefan O'Rear

3:13 p.m.

On Fri, Mar 02, 2007 at 02:58:21PM +0000, Paul Moore wrote:

...

On 02/03/07, Bayley, Alistair wrote:

...
There's a big difference between getContents and Takusen: getContents has a non-trivial implementation (using unsafeInterleaveIO) that allows it to return data lazily. Takusen has no such implementation.

... ie, there's deep dark magic involved in the seemingly simple getContents, which isn't easily available to mere mortals (or even semi-immortal library designers). That figures. It's a shame, but not totally unsurprising.

I think I understand it ... here a some illustrative (I hope!) examples: stefan@stefans:~$ ghci ___ ___ _ / _ \ /\ /\/ __(_) / /_\// /_/ / / | | GHC Interactive, version 6.7.20070223, for Haskell 98. / /_\\/ __ / /___| | http://www.haskell.org/ghc/ \____/\/ /_/\____/|_| Type :? for help. Loading package base ... linking ... done. Prelude> :m + System.IO. System.IO.Error System.IO.Unsafe Prelude> :m + System.IO.Unsafe Prelude System.IO.Unsafe> foo <- unsafeInterleaveIO (putStr "foo") -- note that IO does NOT happen immediately Prelude System.IO.Unsafe> show foo -- but forcing it causes the IO to happen (unsafely interleaved with printing (pun intentional)) "foo()" Prelude System.IO.Unsafe> show foo -- but now that it is in WHNF, forcing it again has no effect (laziness) "()" Prelude System.IO.Unsafe> -- a more interesting case is using unsafeInterleaveIO in list recursion Prelude System.IO.Unsafe> let myGetContents = unsafeInterleaveIO $ do { ch <- getChar; chs <- myGetContents ; return (ch:chs) } Prelude System.IO.Unsafe> -- simplified by omitting support for EOF handling and block reads Prelude System.IO.Unsafe> print . map reverse . lines =<< myGetContents f["oo? ?oof"Interrupted. Prelude System.IO.Unsafe> mapM_ putStrLn . map reverse . lines =<< myGetContents foo? ?oof bar! !rab muahahaha. .ahahahaum ^D^? Interrupted. Prelude System.IO.Unsafe>

...

...
...
That's what my earlier code looked like, and I found it harder to understand than the getContents/process/put approach. I'm trying to explore ways of factoring data manipulation code out of database access functions, but maybe that's not the right way of doing it.

I don't think it's possible to pursue this style of programming with Takusen. If you do, you'll have to process the entire result-set into a data structure and then process it, which has obvious memory implications.

Oh, well. It's mostly irrelevant for me anyway, as the data sets I'm actually playing with are small enough that slurping them into memory isn't an issue - so I just choose between a simple and decoupled implementation or a more complex and scalable one, which is a fairly standard optimisation choice.

Thanks for clarifying. Paul. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

HTH Stefan

Bryan O'Sullivan

6:30 p.m.

Paul Moore wrote:

...

... ie, there's deep dark magic involved in the seemingly simple getContents, which isn't easily available to mere mortals (or even semi-immortal library designers).

That's really not true. getContents looks simple from the outside, and it *can* be simple underneath, too. You can write a getContents on an arbitrary source of data with just one built-in action and two other primitives. The built-in is unsafeInterleaveIO, and the primitives from your data source are "get me the next item" and "am I done yet?". getContents :: MyDataItem a => DataSource a -> IO [a] getContents myDataSource = unsafeInterleaveIO $ do empty <- amIDoneYet myDataSource if empty then return [] else do x <- getMeTheNextItem myDataSource xs <- getContents myDataSource return (x:xs) If you're providing this in a library, you have to consider what you should do for your consumer if amIDoneYet or getMeTheNextItem does "the wrong thing", but that's hardly unusual. You can make the implementation complex, so that it prefetches data, handles exceptions, and whatnot, but the basic idea isn't too terribly scary. You'll see this pattern in a number of Haskell libraries (System.IO, Data.ByteString.Lazy, Control.Concurrent.Chan, and more).

Chris Kuklewicz

4:39 p.m.

...

There's a big difference between getContents and Takusen: getContents has a non-trivial implementation (using unsafeInterleaveIO) that allows it to return data lazily. Takusen has no such implementation. I'm not sure if it would be possible. I don't really understand how getContents works; is there any advice or guidelines as to how to use (or abuse) unsafeInterleaveIO? Some googling has found: http://therning.org/magnus/archives/249 http://www.haskell.org/pipermail/haskell-cafe/2007-January/021373.html http://www.haskell.org/pipermail/haskell-cafe/2007-January/021407.html

http://haskell.org/haskellwiki/IO_inside#unsafePerformIO_and_unsafeInter leaveIO

I contributed to one of those threads, the code in my message http://www.haskell.org/pipermail/haskell-cafe/2007-January/021382.html has a useful example to compile and play with. And if you want generator co-routines that perform IO (such as with a database):

...

import Control.Monad.Cont import System.IO.Unsafe

yield :: a -> ContT [a] IO () yield x = mapContT (fmap (x:)) (return ())

unsafeYield :: a -> ContT [a] IO () unsafeYield x = mapContT (fmap (x:) . unsafeInterleaveIO) (return ())

execGen :: ContT [a] IO v -> IO [a] execGen m = m `runContT` \_ -> return []

test :: IO [Integer] test = execGen $ mapM_ (\x -> liftIO (putStr $ "<" ++ show x ++ ">") >> if even x then unsafeYield x else yield x) [1..]

main = do z <- test print (take 1 z) print (take 2 z) print (take 3 z) print (take 4 z) print (take 5 z)

When run:

...

<1><2>[1] [1,2] [1,2<3><4>,3] [1,2,3,4] [1,2,3,4<5><6>,5]

Note that test returns an infinite list of integers, but the even ones are returned lazily with unsafeInterleaveIO. The use of ContT simplifies the control flow, since one can put yield / unsafeYield statements in the middle of other operations. The computation is shown by the <> bracket numbers, and always computes until an even one is reached. In particular, both <5> and <6> are computed when returning 5. So I think this is a reasonable toy model where two numbers are fetched at a time from IO (standing in for a database), but only as the lazy list is demanded. -- Chris module Main where import Data.Char import System.IO import System.IO.Unsafe newtype Stream a = Stream {next:: (IO (Maybe (a,Stream a)))} -- Run this "main" (e.g. in GHCI) and type several lines of text. -- The program ends when a line of text contains 'q' for the second time -- main = do hSetBuffering stdin NoBuffering hSetBuffering stdout NoBuffering print "Test of strict" opWith =<< strict untilQ print "Test of unsafeStrict" opWith $ unsafeStrict untilQ print "Test of lazy" opWith =<< lazy untilQ print "Test of unsafeLazy" opWith $ unsafeLazy untilQ -- Shorthand for test above. Processing the input through toUpper opWith = mapM_ print . lines . map toUpper untilQ :: Stream Char untilQ = Stream $ do c <- getChar if c == 'q' then return Nothing else return (Just (c,untilQ)) strict :: Stream a -> IO [a] strict s = do mc <- next s case mc of Nothing -> return [] Just (c,s') -> do rest <- strict s' return (c:rest) lazy :: Stream a -> IO [a] lazy s = unsafeInterleaveIO $ do mc <- next s case mc of Nothing -> return [] Just (c,s') -> do rest <- lazy s' return (c:rest) unsafeStrict :: Stream a -> [a] unsafeStrict s = unsafePerformIO (strict s) unsafeLazy :: Stream a -> [a] unsafeLazy s = unsafePerformIO (lazy s)

Bulat Ziganshin

10:38 p.m.

New subject: Re[2]: Takusen and strictness

Hello Alistair, Friday, March 2, 2007, 5:48:17 PM, you wrote:

...

sure if it would be possible. I don't really understand how getContents works; is there any advice or guidelines as to how to use (or abuse) unsafeInterleaveIO? Some googling has found:

i think i can explain this dark side of IO :) one can use unsafePerformIO (not unsafeInterleaveIO) to perform i/o operations noy in predefined order but by demand. for example, the following code: do let c = unsafePerformIO getChar do_proc c will perform getChar i/o call only when value of c is really required by code, i.e. it this call will be performed lazily as any usual Haskell computation now imagine the following code: do let s = [unsafePerformIO getChar, unsafePerformIO getChar, unsafePerformIO getChar] do_proc s three chars inside this list will be computed on demand too, and this means that their values will depend on the order they are consumed. it is not that we usually need :) unsafeInterleaveIO solves this problem - it performs i/o only on demand but allows to define exact *internal* execution order for parts of your datastructure. it is why i wrote that unsafeInterleaveIO makes illegal copy of button :) first, unsafeInterleaveIO has (IO a) action as a parameter and returns value of type 'a': do str <- unsafeInterleaveIO myGetContents second, unsafeInterleaveIO don't perform any action immediately, it only creates a box of type 'a' which on requesting this value will perform action specified as a parameter third, this action by itself may compute the whole value immediately or... use unsafeInterleaveIO again to defer calculation of some sub-components: myGetContents = do c <- getChar s <- unsafeInterleaveIO myGetContents return (c:s) this code will be executed only at the moment when value of str is really demanded. in this moment, getChar will be performed (with result assigned to c) and one more lazy IO box will be created - for s. this box again contains link to the myGetContents call then, list cell returned that contains one char read and link to myGetContents call as a way to compute rest of the list. only at the moment when next value in list required, this operation will be performed again as a final result, we get inability to read second char in list before first one, but lazy character of reading in whole. bingo! ps: of course, actual code should include EOF checking. also note that you can read many chars/records at each call: myGetContents = do c <- replicateM 512 getChar s <- unsafeInterleaveIO myGetContents return (c++s) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

6709

Age (days ago)

6709

Last active (days ago)

List overview

Download

9 comments

7 participants

participants (7)

Bayley, Alistair
Bryan O'Sullivan
Bulat Ziganshin
Chris Kuklewicz
Nicolas Frisby
Paul Moore
Stefan O'Rear