Is it safe to use unsafePerformIO here?

newer
Re: [Haskell-cafe] How to generate...

Cristiano Paris

15 Sep 2009 15 Sep '09

6:36 p.m.

Hi Cafè, I've the following problem: I have a (possibly very long) list of files on disk. Each file contains some metadata at the beginning and continues with a (possibly very large) chunk of data. Now, the program I'm writing can be run in two modes: either read a specific file from the disk and show the whole chunk of data on screen, or read all the files' metadata, sort the file list based on the metadata, and display a summary of those without reading the chunk of data from each file. I've factored out the file access machinery in a single module so as to use it indifferently under the two scenarios. At first, I wrote a piece of code which, in spirit, works like the following reduced case: ------ module Main where import System.IO import Control.Applicative import Data.List import Data.Ord import Debug.Trace data Bit = Bit { index :: Integer, body :: String } readBit fn = withFile fn ReadMode $ \h -> Bit <$> (hGetLine h >>= return . read) <*> readBody where readBody = withFile fn ReadMode $ \h -> do b <- hGetContents h seq b $ trace ("Read body from: " ++ fn) $ return b main = do bl <- mapM readBit ["file1.txt","file2.txt"] mapM_ (putStrLn . show . index) $ sortBy (comparing index) bl ---- which is very expressive as it's written in applicative style. Each file is like the following: ---- file1.txt ---- 1 foo ---- I've created a separate IO action for reading the body in the hope that it wouldn't get executed when the file list is sorted. But, to my surprise, it didn't work as the trace message gets written for each file before the summary is displayed. Thinking about this, I came to the conclusion that the IO Monad is enforcing proper IO ordering so that the IO action for file1's body must be executed right before IO action for file2's index one. If this is true, the only solution that came into my mind was to wrap the IO action for reading the body in an unsafePerformIO call. I actually ran the program with this modification and it works properly. So, as using unsafePerformIO is a Great Evil (TM), I'm wondering if there's a different way to do this which doesn't rely on retyping body as an IO action returning a String, which would break my pure code manipulating the files. My opinion is that using unsafePerformIO here is like ensuring the compiler that there're no observable side effects in running the IO action for reading the body and that no other side effects would impact this IO action. Thank you for any thoughts. Cristiano

Show replies by date

Daniel Fischer

15 Sep 15 Sep

7:13 p.m.

Am Dienstag 15 September 2009 20:36:06 schrieb Cristiano Paris:

...

Hi Cafè,

I've the following problem: I have a (possibly very long) list of files on disk. Each file contains some metadata at the beginning and continues with a (possibly very large) chunk of data.

Now, the program I'm writing can be run in two modes: either read a specific file from the disk and show the whole chunk of data on screen, or read all the files' metadata, sort the file list based on the metadata, and display a summary of those without reading the chunk of data from each file. I've factored out the file access machinery in a single module so as to use it indifferently under the two scenarios.

At first, I wrote a piece of code which, in spirit, works like the following reduced case:

------ module Main where

import System.IO import Control.Applicative import Data.List import Data.Ord

import Debug.Trace

data Bit = Bit { index :: Integer, body :: String }

readBit fn = withFile fn ReadMode $ \h -> Bit <$> (hGetLine h >>= return . read) <*> readBody where readBody = withFile fn ReadMode $ \h -> do b <- hGetContents h seq b $ trace ("Read body from: " ++ fn) $ return b

Still, the body should be read lazily. I'm not sure, but the tracing message may be output because of its position. With where readBody = withFile fn ReadMode $ \h -> do b <- hGetContents h seq b $ return (trace ("Read body from: " ++ fn) b) there's no tracing output.

...

main = do bl <- mapM readBit ["file1.txt","file2.txt"] mapM_ (putStrLn . show . index) $ sortBy (comparing index) bl ----

which is very expressive as it's written in applicative style.

Each file is like the following:

---- file1.txt ---- 1 foo ----

I've created a separate IO action for reading the body in the hope that it wouldn't get executed when the file list is sorted. But, to my surprise, it didn't work as the trace message gets written for each file before the summary is displayed.

Thinking about this, I came to the conclusion that the IO Monad is enforcing proper IO ordering so that the IO action for file1's body must be executed right before IO action for file2's index one.

If this is true, the only solution that came into my mind was to wrap the IO action for reading the body in an unsafePerformIO call. I actually ran the program with this modification and it works properly.

So, as using unsafePerformIO is a Great Evil (TM), I'm wondering if there's a different way to do this which doesn't rely on retyping body as an IO action returning a String, which would break my pure code manipulating the files.

My opinion is that using unsafePerformIO here is like ensuring the compiler that there're no observable side effects in running the IO action for reading the body and that no other side effects would impact this IO action.

Thank you for any thoughts.

Cristiano

Daniel Fischer

7:29 p.m.

Am Dienstag 15 September 2009 21:13:02 schrieb Daniel Fischer:

...

Still, the body should be read lazily. I'm not sure, but the tracing message may be output because of its position.

With

where readBody = withFile fn ReadMode $ \h -> do b <- hGetContents h seq b $ return (trace ("Read body from: " ++ fn) b)

there's no tracing output.

Yes, tested with -rw-r--r-- 1 me users 243M 15. Sep 21:17 file1.txt -rw-r--r-- 1 me users 243M 15. Sep 21:18 file2.txt original: ./cparis2 +RTS -Sstderr Alloc Copied Live GC GC TOT TOT Page Flts bytes bytes bytes user elap user elap Read body: file1.txt Read body: file2.txt 2 3 427996 1408 18620 0.00 0.00 0.00 0.00 0 0 (Gen: 1) 4096 0.00 0.00 432,092 bytes allocated in the heap 1,408 bytes copied during GC 18,620 bytes maximum residency (1 sample(s)) 22,340 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 0.00s ( 0.00s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.00s ( 0.00s elapsed) %GC time 0.0% (8.3% elapsed) Alloc rate 432,092,000,000 bytes per MUT second Productivity 100.0% of total user, 0.1% of total elapsed moved trace: ./CParis +RTS -Sstderr Alloc Copied Live GC GC TOT TOT Page Flts bytes bytes bytes user elap user elap 2 3 426100 1408 18476 0.00 0.00 0.00 0.00 0 0 (Gen: 1) 4096 0.00 0.00 430,196 bytes allocated in the heap 1,408 bytes copied during GC 18,476 bytes maximum residency (1 sample(s)) 22,484 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 0.00s ( 0.00s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.00s ( 0.00s elapsed) %GC time 0.0% (9.3% elapsed) Alloc rate 107,549,000 bytes per MUT second Productivity 100.0% of total user, 230.3% of total elapsed

Cristiano Paris

8:08 p.m.

On Tue, Sep 15, 2009 at 9:29 PM, Daniel Fischer wrote:

...

Am Dienstag 15 September 2009 21:13:02 schrieb Daniel Fischer:

...
Still, the body should be read lazily. I'm not sure, but the tracing message may be output because of its position.

With

where readBody = withFile fn ReadMode $ \h -> do b <- hGetContents h seq b $ return (trace ("Read body from: " ++ fn) b)

there's no tracing output.

Yes, tested with -rw-r--r-- 1 me users 243M 15. Sep 21:17 file1.txt -rw-r--r-- 1 me users 243M 15. Sep 21:18 file2.txt

Ok, Daniel, I got the point: the IO action gets performed but there's no need to use unsafePerformIO as hGetContents is already lazy and the IO action is "ineffective" anyway when the result is not needed. Yet, I'm still confused as "seq b" should force the complete execution of hGetContents. So I decided to run a different test: I'm using this code: --- module Main where import System.IO import System.IO.Unsafe import Control.Applicative import Data.List import Data.Ord import Debug.Trace data Bit = Bit { index :: Integer, body :: String } readBit fn = withFile fn ReadMode $ \h -> Bit <$> (hGetLine h >>= return . read) <*> readBody where readBody = trace "In readBody" $ withFile fn ReadMode $ \h -> do b <- hGetContents h seq b $ return $ trace ("Read body from: " ++ fn) b main = do bl <- mapM readBit ["file1.txt","file2.txt"] mapM_ (putStrLn . show . index) $ sortBy (comparing index) bl putStrLn $ body $ head bl ---- (Hope this looks better than before). I ran this on a 115KB-long file1.txt file and traced with dtruss (OSX strace equivalent). You know what? Only the first 1024 bytes of file1.txt are read and actually displayed. So, it seems that "seq b" is completely ineffective and program is not correct. Cristiano

Cristiano Paris

8:11 p.m.

On Tue, Sep 15, 2009 at 10:08 PM, Cristiano Paris wrote:

...

... So, it seems that "seq b" is completely ineffective and program is not correct.

Correction: removing "seq b" results in nothing being displayed :) So, it's not "completely" effective. I suspect this is related to the fact that a String in Haskell is just a list of Char so we should use seq on every element of b. Let me try... Cristiano

Cristiano Paris

8:25 p.m.

On Tue, Sep 15, 2009 at 10:11 PM, Cristiano Paris wrote:

...

On Tue, Sep 15, 2009 at 10:08 PM, Cristiano Paris wrote:

...
... So, it seems that "seq b" is completely ineffective and program is not correct.

Correction: removing "seq b" results in nothing being displayed :)

So, it's not "completely" effective. I suspect this is related to the fact that a String in Haskell is just a list of Char so we should use seq on every element of b. Let me try...

Now it works as expected: --- module Main where import System.IO import System.IO.Unsafe import Control.Applicative import Data.List import Data.Ord import Debug.Trace data Bit = Bit { index :: Integer, body :: String } readBit fn = withFile fn ReadMode $ \h -> Bit <$> (hGetLine h >>= return . read) <*> readBody where readBody = trace "In readBody" $ withFile fn ReadMode $ \h -> do b <- hGetContents h let b' = foldr (\e a -> seq e (a ++ [e])) [] b seq b' $ return $ trace ("Read body from: " ++ fn) b' main = do bl <- mapM readBit ["file1.txt","file2.txt"] mapM_ (putStrLn . show . index) $ sortBy (comparing index) bl putStrLn $ body $ head bl ---- Two points: 1 - I had to cut off file1.txt to be just above 1024 bytes otherwise the program becomes extremely slow even on a 100KB file with a line being output every 5 seconds and with my CPU being completely busy (I'm using a modern MacBook Pro). 2 - Omitting the last line in my program actually causes the body to be completely read even if it's not used: this is consistent with my hypotesis on seq which now works properly. :) Cristiano

Cristiano Paris

8:30 p.m.

On Tue, Sep 15, 2009 at 10:25 PM, Cristiano Paris wrote:

...

... Two points:

1 - I had to cut off file1.txt to be just above 1024 bytes otherwise the program becomes extremely slow even on a 100KB file with a line being output every 5 seconds and with my CPU being completely busy (I'm using a modern MacBook Pro).

GC in charge here. Using "foldl" resolves the problem. Maybe too many thunks are being created on the stack... Cristiano

Daniel Fischer

8:42 p.m.

Am Dienstag 15 September 2009 22:25:31 schrieb Cristiano Paris:

...

On Tue, Sep 15, 2009 at 10:11 PM, Cristiano Paris wrote:

...
On Tue, Sep 15, 2009 at 10:08 PM, Cristiano Paris wrote:

...
... So, it seems that "seq b" is completely ineffective and program is not correct.

Correction: removing "seq b" results in nothing being displayed :)

So, it's not "completely" effective. I suspect this is related to the fact that a String in Haskell is just a list of Char so we should use seq on every element of b. Let me try...

Now it works as expected:

--- module Main where

import System.IO import System.IO.Unsafe import Control.Applicative import Data.List import Data.Ord

import Debug.Trace

data Bit = Bit { index :: Integer, body :: String }

readBit fn = withFile fn ReadMode $ \h -> Bit <$> (hGetLine h >>= return . read) <*> readBody where readBody = trace "In readBody" $ withFile fn ReadMode $ \h -> do b <- hGetContents h let b' = foldr (\e a -> seq e (a ++ [e])) [] b

Aaawww. let b' = length b or let b' = foldl' seq () b or let b' = b `using` rnf if you want to force the whole file to be read. But then you should definitely be using ByteStrings.

...

seq b' $ return $ trace ("Read body from: " ++ fn) b'

main = do bl <- mapM readBit ["file1.txt","file2.txt"] mapM_ (putStrLn . show . index) $ sortBy (comparing index) bl putStrLn $ body $ head bl ----

Two points:

1 - I had to cut off file1.txt to be just above 1024 bytes otherwise the program becomes extremely slow even on a 100KB file with a line being output every 5 seconds and with my CPU being completely busy (I'm using a modern MacBook Pro).

2 - Omitting the last line in my program actually causes the body to be completely read even if it's not used: this is consistent with my hypotesis on seq which now works properly.

:)

Cristiano

Cristiano Paris

9 p.m.

On Tue, Sep 15, 2009 at 10:42 PM, Daniel Fischer wrote:

...

.... Aaawww. let b' = length b or let b' = foldl' seq () b or let b' = b `using` rnf

if you want to force the whole file to be read. But then you should definitely be using ByteStrings.

Yep. But that doesn't solve the original problem of not reading the body at all when not needed. Unless using unsafePerformIO. Cristiano

Daniel Fischer

9:31 p.m.

Am Dienstag 15 September 2009 23:00:40 schrieb Cristiano Paris:

...

On Tue, Sep 15, 2009 at 10:42 PM, Daniel Fischer

wrote:

...
.... Aaawww. let b' = length b or let b' = foldl' seq () b or let b' = b `using` rnf

if you want to force the whole file to be read. But then you should definitely be using ByteStrings.

Yep. But that doesn't solve the original problem of not reading the body at all when not needed. Unless using unsafePerformIO.

Yeah, you do *not* want the whole file to be read here, except above for testing purposes. Still, ByteStrings are probably the better choice (if you want the body and that can be large). To avoid reading the body without unsafePerformIO: readBit fn = Control.Exception.bracket (openFile fn ReadMode) hClose (\h -> do l <- hGetLine h let i = read l bdy <- hGetContents h return $ Bit i bdy)

...

Cristiano

Cristiano Paris

17 Sep 17 Sep

7:07 p.m.

On Tue, Sep 15, 2009 at 11:31 PM, Daniel Fischer wrote:

...

... Yeah, you do *not* want the whole file to be read here, except above for testing purposes.

That's not true. Sometimes I want to, sometimes don't. But I want to use the same code for reading files and exploit laziness to avoid reading the body.

...

Still, ByteStrings are probably the better choice (if you want the body and that can be large).

That's not a problem by now.

...

To avoid reading the body without unsafePerformIO:

readBit fn = Control.Exception.bracket (openFile fn ReadMode) hClose (\h -> do l <- hGetLine h let i = read l bdy <- hGetContents h return $ Bit i bdy)

Same problem with the "withFile"-version: nothing gets printed if I try to print out the body: that's way I used seq. I'm starting to think that the only way to do this without using unsafePerformIO is to have the body being an IO action: simply, under Haskell assumption, that's not possible to write, because Haskell enforce safety above all. Cristiano

Daniel Fischer

8:01 p.m.

Am Donnerstag 17 September 2009 21:07:28 schrieb Cristiano Paris:

...

On Tue, Sep 15, 2009 at 11:31 PM, Daniel Fischer

wrote:

...
... Yeah, you do *not* want the whole file to be read here, except above for testing purposes.

That's not true. Sometimes I want to, sometimes don't.

The "for the case of sorting by metadata" was tacitly assumed :)

...

But I want to use the same code for reading files and exploit laziness to avoid reading the body.

...
Still, ByteStrings are probably the better choice (if you want the body and that can be large).

That's not a problem by now.

...
To avoid reading the body without unsafePerformIO:

readBit fn = Control.Exception.bracket (openFile fn ReadMode) hClose (\h -> do l <- hGetLine h let i = read l bdy <- hGetContents h return $ Bit i bdy)

Same problem with the "withFile"-version: nothing gets printed if I try to print out the body: that's way I used seq.

Ah, yes. The file is closed too soon.

...

I'm starting to think that the only way to do this without using unsafePerformIO is to have the body being an IO action: simply, under Haskell assumption, that's not possible to write, because Haskell enforce safety above all.

Well, what about readBit fn = do txt <- readFile fn let (l,_:bdy) = span (/= '\n') txt return $ Bit (read l) bdy ? With main = do args <- getArgs let n = case args of (a:_) -> read a _ -> 1000 bl <- mapM readBit ["file1.txt","file2.txt"] mapM_ (putStrLn . show . index) $ sortBy (comparing index) bl mapM_ (putStrLn . take 20 . drop n . body) bl ./cparis3 30 +RTS -sstderr 2 3 CCGGGCGCGGTGGCTCACGC CCGGGCGCGGTGGCTCACGC 408,320 bytes allocated in the heap 1,220 bytes copied during GC 34,440 bytes maximum residency (1 sample(s)) 31,096 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) ./cparis3 20000 +RTS -sstderr 2 3 AAAATTAGCCGGGCGTGGTG AAAATTAGCCGGGCGTGGTG 1,069,168 bytes allocated in the heap 105,700 bytes copied during GC 137,356 bytes maximum residency (1 sample(s)) 27,344 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) ./cparis3 2000000 +RTS -sstderr 2 3 CCTGGCCAACATGGTGAAAC CCTGGCCAACATGGTGAAAC 80,939,296 bytes allocated in the heap 8,925,240 bytes copied during GC 137,056 bytes maximum residency (2 sample(s)) 45,528 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) %GC time 38.5% (27.0% elapsed) Alloc rate 1,264,577,704 bytes per MUT second Productivity 61.5% of total user, 38.8% of total elapsed ./cparis3 20000000 +RTS -sstderr 2 3 CAGAGCGAGACTCCGTCTCA CAGAGCGAGACTCCGTCTCA 806,034,756 bytes allocated in the heap 76,775,944 bytes copied during GC 136,876 bytes maximum residency (2 sample(s)) 43,324 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 1536 collections, 0 parallel, 0.35s, 0.35s elapsed Generation 1: 2 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 0.53s ( 0.67s elapsed) GC time 0.35s ( 0.36s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.88s ( 1.02s elapsed) %GC time 40.0% (34.9% elapsed) Alloc rate 1,526,482,681 bytes per MUT second Productivity 60.0% of total user, 51.7% of total elapsed Seems to work as desired.

...

Cristiano

Cristiano Paris

8:20 p.m.

On Thu, Sep 17, 2009 at 10:01 PM, Daniel Fischer wrote:

...

... readBit fn = do txt <- readFile fn let (l,_:bdy) = span (/= '\n') txt return $ Bit (read l) bdy

?

With

main = do args <- getArgs let n = case args of (a:_) -> read a _ -> 1000 bl <- mapM readBit ["file1.txt","file2.txt"] mapM_ (putStrLn . show . index) $ sortBy (comparing index) bl mapM_ (putStrLn . take 20 . drop n . body) bl

Yes, it *seems* to work but... the files don't get closed (readFile is unfinished until body is read) so I think I'm going to have problems when the number of files to read is higher than the maximum number of open handles a process can have. That's a possibility I considered even if not directly using readFile. Cristiano

Daniel Fischer

10:41 p.m.

Am Donnerstag 17 September 2009 22:20:55 schrieb Cristiano Paris:

...

On Thu, Sep 17, 2009 at 10:01 PM, Daniel Fischer

wrote:

...
... readBit fn = do txt <- readFile fn let (l,_:bdy) = span (/= '\n') txt return $ Bit (read l) bdy

?

With

main = do args <- getArgs let n = case args of (a:_) -> read a _ -> 1000 bl <- mapM readBit ["file1.txt","file2.txt"] mapM_ (putStrLn . show . index) $ sortBy (comparing index) bl mapM_ (putStrLn . take 20 . drop n . body) bl

Yes, it *seems* to work but... the files don't get closed (readFile is unfinished until body is read) so I think I'm going to have problems when the number of files to read is higher than the maximum number of open handles a process can have.

Indeed. If the number of files is large, reading lazily with readFile is not so good. Eat the cake and have it. If you have a lot of files, want to read the metadata of all, select a (much) smaller number of files by some criterion on the set of metadata and then read the body of the selected files, it's hairy. Reading all bodies immediately is probably out due to memory restrictions. The clean approach would be to separate the reading of metadata and body. The drawback is that then you have a second entry into IO. Using unsafePerformIO, you can pretend that you don't reenter IO. Whether that is safe in your situation, I don't know. Probably not (rule of thumb: all nontrivial actions wrapped in unsafePerformIO aren't safe, though chances aren't bad that it works most of the time).

...

That's a possibility I considered even if not directly using readFile.

Cristiano

Ryan Ingram

18 Sep 18 Sep

2:06 a.m.

I am confused about why this thread is talking about unsafePerformIO at all. It seems like everything you all want to do can be accomplished with the much less evil unsafeInterleaveIO instead. (Which is still a bit evil; but it's the difference between stealing cookies from the cookie jar and committing genocide) I wrote this function recently for a quick'n'dirty script:

...

readFiles :: [FilePath] -> String readFiles [] = return "" readFiles (f:fs) = do f_data <- readFile f rest <- unsafeInterleaveIO (readFiles fs) return (f_data ++ rest)

It lazily reads from many files and concatenates all the input. But I probably wouldn't use it in a serious application. -- ryan On Thu, Sep 17, 2009 at 3:41 PM, Daniel Fischer wrote:

...

Am Donnerstag 17 September 2009 22:20:55 schrieb Cristiano Paris:

...
On Thu, Sep 17, 2009 at 10:01 PM, Daniel Fischer

wrote:

...
... readBit fn = do txt <- readFile fn let (l,_:bdy) = span (/= '\n') txt return $ Bit (read l) bdy

?

With

main = do args <- getArgs let n = case args of (a:_) -> read a _ -> 1000 bl <- mapM readBit ["file1.txt","file2.txt"] mapM_ (putStrLn . show . index) $ sortBy (comparing index) bl mapM_ (putStrLn . take 20 . drop n . body) bl

Yes, it *seems* to work but... the files don't get closed (readFile is unfinished until body is read) so I think I'm going to have problems when the number of files to read is higher than the maximum number of open handles a process can have.

Indeed. If the number of files is large, reading lazily with readFile is not so good. Eat the cake and have it. If you have a lot of files, want to read the metadata of all, select a (much) smaller number of files by some criterion on the set of metadata and then read the body of the selected files, it's hairy. Reading all bodies immediately is probably out due to memory restrictions. The clean approach would be to separate the reading of metadata and body. The drawback is that then you have a second entry into IO. Using unsafePerformIO, you can pretend that you don't reenter IO. Whether that is safe in your situation, I don't know. Probably not (rule of thumb: all nontrivial actions wrapped in unsafePerformIO aren't safe, though chances aren't bad that it works most of the time).

...
That's a possibility I considered even if not directly using readFile.

Cristiano

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Daniel Fischer

3:15 a.m.

Am Freitag 18 September 2009 04:06:11 schrieb Ryan Ingram:

...

I am confused about why this thread is talking about unsafePerformIO at all. It seems like everything you all want to do can be accomplished with the much less evil unsafeInterleaveIO instead. (Which is still a bit evil; but it's the difference between stealing cookies from the cookie jar and committing genocide)

I find that remark in rather bad taste.

...

I wrote this function recently for a quick'n'dirty script:

...
readFiles :: [FilePath] -> String readFiles [] = return "" readFiles (f:fs) = do f_data <- readFile f rest <- unsafeInterleaveIO (readFiles fs) return (f_data ++ rest)

It lazily reads from many files and concatenates all the input. But I probably wouldn't use it in a serious application.

-- ryan

But that does something completely different from what Cristiano wants to do. He wants to read many files files quasi-parallel. As far as I can tell, he needs to read a small chunk from the beginning of every file, then, depending on what he got from that, he needs to read the rest of some files. If he reads all the files lazily, he (maybe) runs into the open file limit (a semi-closed handle is still open from the OS' point of view, isn't it?). So he has to close the first files before he opens the Nth. But what if later he finds out that he has to read the body of a previously closed file? I would separate the reading of headers and bodies, reopening the files whose body is needed, for some (maybe compelling) reason he wants to do it differently.

Cristiano Paris

8:56 a.m.

On Fri, Sep 18, 2009 at 5:15 AM, Daniel Fischer wrote:

...

... But that does something completely different from what Cristiano wants to do. He wants to read many files files quasi-parallel. As far as I can tell, he needs to read a small chunk from the beginning of every file, then, depending on what he got from that, he needs to read the rest of some files. If he reads all the files lazily, he (maybe) runs into the open file limit (a semi-closed handle is still open from the OS' point of view, isn't it?).

Not quite. In one case I want to dumbly read all the files' metadata and printing them out sorted: reading the next file doesn't depend on what happened previously so they could be read in parallel indeed. In the other case, it reads a specific file completely and print everything out. In both cases, I want to use the same code for reading, exploiting laziness in order not to read the body of files if only metadata are requested.

...

I would separate the reading of headers and bodies, reopening the files whose body is needed, for some (maybe compelling) reason he wants to do it differently.

Yes, that's the way Haskell forces you to do that as it's the only way for you to go safe. But, if you know more about your code, you can use unsafe(Perform|Interleave)IO to assure the compiler that everything's right. Cristiano

Ben Franksen

19 Sep 19 Sep

4:53 p.m.

Cristiano Paris wrote:

...

Daniel Fischer wrote:

...
I would separate the reading of headers and bodies, reopening the files whose body is needed, for some (maybe compelling) reason he wants to do it differently.

Yes, that's the way Haskell forces you to do that as it's the only way for you to go safe.

I don't think it has anything to do with Haskell. How would you do this in C? You'd pass a flag indicating whether to read the whole file or just the header. You can do the same in Haskell, of course, no lazy IO needed. The body remains undefined if the flag indicates header only. Even better wrap the body in a Maybe.

...

But, if you know more about your code, you can use unsafe(Perform|Interleave)IO to assure the compiler that everything's right.

I have a hard time believing this is possible, if you demand that the files should not stay opened indefinitely. How is the runtime supposed to know whether to close the file or not? What you /can/ do is use unsafePerformIO to lazily re-open, read the body, and close the file, as soon as the body gets demanded. However, this is ugly and not advised. Cheers Ben

Cristiano Paris

20 Sep 20 Sep

9:35 a.m.

On Sat, Sep 19, 2009 at 6:53 PM, Ben Franksen wrote:

...

Cristiano Paris wrote:

...
Daniel Fischer wrote:

...
I would separate the reading of headers and bodies, reopening the files whose body is needed, for some (maybe compelling) reason he wants to do it differently.

Yes, that's the way Haskell forces you to do that as it's the only way for you to go safe.

I don't think it has anything to do with Haskell.

My sentence was to be understood in a positive way. I think this has a lot to do with Haskell, as it forbids you to write certain kinds of programs: this is a distinguished feature of all the static typed languages and more specifically in Haskell, which has a very expressive type system. In particular, you can't have a program that runs IO "out-of-order": if you really want to do that, you must use unsafe(Perform|Interleave)IO which is like cheating, in a way. Indeed, the only way to do that not using unsafePerformIO is to have a two stage reading, either making the body an IO action or having an intermediate data structure which represents a file whose body has not been read yet: in general this is safer, unless you know something about your program and can assure the compiler it won't behave badly with respect to side effects.

...

How would you do this in C? You'd pass a flag indicating whether to read the whole file or just the header. You can do the same in Haskell, of course, no lazy IO needed. The body remains undefined if the flag indicates header only. Even better wrap the body in a Maybe.

The difference is in the expressivity of the type system. In C I may use a NULL pointer indicating the body has not been read yet, but then the compiler won't enforce good uses of that pointer and can't assure you that in no way a NULL pointer would be ever dereferenced, leaving room for untested, uncaught bugs. In Haskell this is simply not possible.

...

... I have a hard time believing this is possible, if you demand that the files should not stay opened indefinitely. How is the runtime supposed to know whether to close the file or not? What you /can/ do is use unsafePerformIO to lazily re-open, read the body, and close the file, as soon as the body gets demanded. However, this is ugly and not advised.

Here in Cafè once I had a discussion about when is safe and advisable to use unsafePerformIO. I don't think this function is evil per-se and indeed THERE ARE situations where it's more elegant and clear implementing things using unsafePerformIO. I see it as way to tell the compiler "don't mind: I know what I'm doing". This is certainly true when using FFI to implement in C pure functions, but it's true in other situations, like mine, in which using unsafeInterleaveIO allows me to write my code cleanly and easily, separating IO from processing, and avoid having to read the files in a two-stage way. Of course, I can be wrong and I can't look forward to hearing from the wise people and argument againts my point.

...

Cheers Ben

Thanks. Cristiano

Cristiano Paris

18 Sep 18 Sep

8:49 a.m.

On Fri, Sep 18, 2009 at 4:06 AM, Ryan Ingram wrote:

...

I am confused about why this thread is talking about unsafePerformIO at all. It seems like everything you all want to do can be accomplished with the much less evil unsafeInterleaveIO instead. (Which is still a bit evil; but it's the difference between stealing cookies from the cookie jar and committing genocide)

I didn't read about unsafeIntervleaveIO but yesterday I rewrote my code exactly as you did here, and it allowed me to use the applicative style... Cristiano

Ross Mellgren

15 Sep 15 Sep

8:12 p.m.

Wouldn't seq b only force (at minimum) the first character of the file? -Ross On Sep 15, 2009, at 4:08 PM, Cristiano Paris wrote:

...

On Tue, Sep 15, 2009 at 9:29 PM, Daniel Fischer wrote:

...
Am Dienstag 15 September 2009 21:13:02 schrieb Daniel Fischer:

...
Still, the body should be read lazily. I'm not sure, but the tracing message may be output because of its position.

With

where readBody = withFile fn ReadMode $ \h -> do b <- hGetContents h seq b $ return (trace ("Read body from: " ++ fn) b)

there's no tracing output.

Yes, tested with -rw-r--r-- 1 me users 243M 15. Sep 21:17 file1.txt -rw-r--r-- 1 me users 243M 15. Sep 21:18 file2.txt

Ok, Daniel, I got the point: the IO action gets performed but there's no need to use unsafePerformIO as hGetContents is already lazy and the IO action is "ineffective" anyway when the result is not needed. Yet, I'm still confused as "seq b" should force the complete execution of hGetContents. So I decided to run a different test:

I'm using this code:

--- module Main where

import System.IO import System.IO.Unsafe import Control.Applicative import Data.List import Data.Ord

import Debug.Trace

data Bit = Bit { index :: Integer, body :: String }

readBit fn = withFile fn ReadMode $ \h -> Bit <$> (hGetLine h >>= return . read) <*> readBody where readBody = trace "In readBody" $ withFile fn ReadMode $ \h -> do b <- hGetContents h seq b $ return $ trace ("Read body from: " ++ fn) b

main = do bl <- mapM readBit ["file1.txt","file2.txt"] mapM_ (putStrLn . show . index) $ sortBy (comparing index) bl putStrLn $ body $ head bl ----

(Hope this looks better than before).

I ran this on a 115KB-long file1.txt file and traced with dtruss (OSX strace equivalent). You know what? Only the first 1024 bytes of file1.txt are read and actually displayed.

So, it seems that "seq b" is completely ineffective and program is not correct.

Cristiano _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Cristiano Paris

8:17 p.m.

On Tue, Sep 15, 2009 at 10:12 PM, Ross Mellgren wrote:

...

Wouldn't seq b only force (at minimum) the first character of the file?

I think it force the evaluation of the "Cons" in the String but not the characters therein. Cristiano

Ross Mellgren

8:20 p.m.

Ah yeah, that too. Control.Parallel.Strategies.rnf to the rescue? -Ross On Sep 15, 2009, at 4:17 PM, Cristiano Paris wrote:

...

On Tue, Sep 15, 2009 at 10:12 PM, Ross Mellgren wrote:

...
Wouldn't seq b only force (at minimum) the first character of the file?

I think it force the evaluation of the "Cons" in the String but not the characters therein.

Cristiano

Ross Mellgren

8:20 p.m.

Ack, IGNORE ME! Way too strict. -Ross On Sep 15, 2009, at 4:20 PM, Ross Mellgren wrote:

...

Ah yeah, that too. Control.Parallel.Strategies.rnf to the rescue?

-Ross

On Sep 15, 2009, at 4:17 PM, Cristiano Paris wrote:

...
On Tue, Sep 15, 2009 at 10:12 PM, Ross Mellgren wrote:

...
Wouldn't seq b only force (at minimum) the first character of the file?

I think it force the evaluation of the "Cons" in the String but not the characters therein.

Cristiano

Cristiano Paris

8:27 p.m.

On Tue, Sep 15, 2009 at 10:20 PM, Ross Mellgren wrote:

...

Ack, IGNORE ME! Way too strict.

Oh, well, I used foldr+seq to achieve the same result... I think, but I think that, if this is the solution, I'll use rnf as I did on other occasions. Thanks. Cristiano

Daniel Fischer

8:29 p.m.

Am Dienstag 15 September 2009 22:17:00 schrieb Cristiano Paris:

...

On Tue, Sep 15, 2009 at 10:12 PM, Ross Mellgren wrote:

...
Wouldn't seq b only force (at minimum) the first character of the file?

I think it force the evaluation of the "Cons" in the String but not the characters therein.

It evaluates the String far enough to know whether it's "" or (_:_), that is, to weak head normal form. It doesn't look at any character, but it forces at least one character to be read from the file.

...

Cristiano

Cristiano Paris

8:35 p.m.

On Tue, Sep 15, 2009 at 10:29 PM, Daniel Fischer wrote:

...

... It evaluates the String far enough to know whether it's "" or (_:_), that is, to weak head normal form. It doesn't look at any character, but it forces at least one character to be read from the file.

Yep, the head. That explains why only the first 1024 bytes are read and displayed... Cristiano

Cristiano Paris

7:29 p.m.

On Tue, Sep 15, 2009 at 9:13 PM, Daniel Fischer wrote:

...

... Still, the body should be read lazily. I'm not sure, but the tracing message may be output because of its position.

With

where readBody = withFile fn ReadMode $ \h -> do b <- hGetContents h seq b $ return (trace ("Read body from: " ++ fn) b)

there's no tracing output.

Impressive as I can't spot the difference. But my question is: is the chunk of data actually read or not? Consider the following code: ---- readBit fn = withFile fn ReadMode $ \h -> Bit <$> (hGetLine h >>= return . read) <*> readBody where readBody = trace "In readBody" $ withFile fn ReadMode $ \h -> do b <- hGetContents h seq b $ return $ trace ("Read body from: " ++ fn) b ---- The message "In readBody" gets written, while "Read body from..." don't. So? I think that the IO action is actually performed but the result is simply discarded. Am I right? Cristiano

Svein Ove Aas

7:16 p.m.

I have a number of suggestions, some of which conflict with each other, so I'll just throw them out here. Let's see.. First off, the IO monad does indeed enforce sequencing; that's its primary purpose. However, you can ask for it to run I/O out of order, specifically when the value the out-of-order action returns is actually forced (used); that's lazy I/O, and is implemented using unsafeInterleaveIO. You would not usually use unsafeInterleaveIO directly, though. Instead, you'd use an existing wrapper, such as hGetContents. (for Strings, or lazy bytestrings; the strict bytestring variant reasonably has a strict semantics) One thing to keep in mind about lazy I/O is that the I/O in question can run at any arbitrary time, or not at all; not more than once, though. You must make sure this is safe. For file input, that basically means the file should not change during the program's lifetime. hGetLine is not lazy in this way, but the hGetContents you use is. I'm not sure whether this means your program should work as-is, and I'm not going to examine it closely enough to tell - as you mentioned it's a mockup anyway. Besides.. Strings are also *slow*. What you want for I/O is, when reasonably possible, bytestrings. You'd then use parsec-bytestring, or if possible Data.Binary, to parse said bytestring; the latter is faster (..probably), if more limited in function. You could use the lazy bytestring hGetContents for this. However... There is also a bytestring-mmap package on hackage, which outsources the decision of what blocks to load into memory to the OS, and has the best performance overall. Use this. Oh. And unsafePerformIO is a trap that will kill you. See http://www.girlgeniusonline.com/comic.php?date=20070725 for details. On Tue, Sep 15, 2009 at 8:36 PM, Cristiano Paris wrote:

...

Hi Cafè,

I've the following problem: I have a (possibly very long) list of files on disk. Each file contains some metadata at the beginning and continues with a (possibly very large) chunk of data.

Now, the program I'm writing can be run in two modes: either read a specific file from the disk and show the whole chunk of data on screen, or read all the files' metadata, sort the file list based on the metadata, and display a summary of those without reading the chunk of data from each file. I've factored out the file access machinery in a single module so as to use it indifferently under the two scenarios.

At first, I wrote a piece of code which, in spirit, works like the following reduced case:

------ module Main where

import System.IO import Control.Applicative import Data.List import Data.Ord

import Debug.Trace

data Bit = Bit { index :: Integer, body :: String }

readBit fn = withFile fn ReadMode $ \h -> Bit <$> (hGetLine h >>= return . read) <*> readBody where readBody = withFile fn ReadMode $ \h -> do b <- hGetContents h seq b $ trace ("Read body from: " ++ fn) $ return b

main = do bl <- mapM readBit ["file1.txt","file2.txt"] mapM_ (putStrLn . show . index) $ sortBy (comparing index) bl ----

which is very expressive as it's written in applicative style.

Each file is like the following:

---- file1.txt ---- 1 foo ----

I've created a separate IO action for reading the body in the hope that it wouldn't get executed when the file list is sorted. But, to my surprise, it didn't work as the trace message gets written for each file before the summary is displayed.

Thinking about this, I came to the conclusion that the IO Monad is enforcing proper IO ordering so that the IO action for file1's body must be executed right before IO action for file2's index one.

If this is true, the only solution that came into my mind was to wrap the IO action for reading the body in an unsafePerformIO call. I actually ran the program with this modification and it works properly.

So, as using unsafePerformIO is a Great Evil (TM), I'm wondering if there's a different way to do this which doesn't rely on retyping body as an IO action returning a String, which would break my pure code manipulating the files.

My opinion is that using unsafePerformIO here is like ensuring the compiler that there're no observable side effects in running the IO action for reading the body and that no other side effects would impact this IO action.

Thank you for any thoughts.

Cristiano _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Svein Ove Aas

Cristiano Paris

7:34 p.m.

On Tue, Sep 15, 2009 at 9:16 PM, Svein Ove Aas wrote:

...

I have a number of suggestions, some of which conflict with each other, so I'll just throw them out here. Let's see..

...

First off, the IO monad does indeed enforce sequencing; that's its primary purpose.

So, unsafePerformIO can be thought as an escape to this strict rule.

...

However, you can ask for it to run I/O out of order, specifically when the value the out-of-order action returns is actually forced (used); that's lazy I/O, and is implemented using unsafeInterleaveIO.

I imagined that. But I hate to have to use unsafePerformIO when hGetContets is already using it.

...

You would not usually use unsafeInterleaveIO directly, though.

That's the point.

...

Instead, you'd use an existing wrapper, such as hGetContents. (for Strings, or lazy bytestrings; the strict bytestring variant reasonably has a strict semantics)

One thing to keep in mind about lazy I/O is that the I/O in question can run at any arbitrary time, or not at all; not more than once, though. You must make sure this is safe. For file input, that basically means the file should not change during the program's lifetime.

Ok.

...

hGetLine is not lazy in this way, but the hGetContents you use is. I'm not sure whether this means your program should work as-is, and I'm not going to examine it closely enough to tell - as you mentioned it's a mockup anyway. Besides..

Strings are also *slow*. What you want for I/O is, when reasonably possible, bytestrings. You'd then use parsec-bytestring, or if possible Data.Binary, to parse said bytestring; the latter is faster (..probably), if more limited in function.

Yes, that was only a first attempt, kind of prototype...

...

You could use the lazy bytestring hGetContents for this. However...

There is also a bytestring-mmap package on hackage, which outsources the decision of what blocks to load into memory to the OS, and has the best performance overall. Use this.

Oh. And unsafePerformIO is a trap that will kill you. See http://www.girlgeniusonline.com/comic.php?date=20070725 for details.

Thank you for the link. Cristiano

Svein Ove Aas

7:39 p.m.

As a general sort of warning, do not use hGetContents (or lazy i/o, in general) in combination with withFile. withFile closes the handle when the program lexically exits its scope. However, when using hGetContents, the file contents will not be read until after you do this, and will therefore fail to be read at all; I'm not sure whether this will produce a truncated string or an exception. Instead, use openFile directly. Handles have (ISTR) finalizers on them, and so should be automatically closed if you lose hold of one.. eventually. getContents of course closes it once it hits EOF, but that isn't exactly reliable. If that isn't satisfactory, use strict I/O. It's less convenient, but it's also easier to reason about. -- Svein Ove Aas

Cristiano Paris

7:41 p.m.

On Tue, Sep 15, 2009 at 9:39 PM, Svein Ove Aas wrote:

...

As a general sort of warning, do not use hGetContents (or lazy i/o, in general) in combination with withFile.

withFile closes the handle when the program lexically exits its scope. However, when using hGetContents, the file contents will not be read until after you do this, and will therefore fail to be read at all; I'm not sure whether this will produce a truncated string or an exception.

A truncated string. I already encountered such a scenario.

...

Instead, use openFile directly. Handles have (ISTR) finalizers on them, and so should be automatically closed if you lose hold of one.. eventually. getContents of course closes it once it hits EOF, but that isn't exactly reliable.

If that isn't satisfactory, use strict I/O. It's less convenient, but it's also easier to reason about.

Thank you. Cristiano

5769

Age (days ago)

5774

Last active (days ago)

List overview

Download

31 comments

7 participants

participants (7)

Ben Franksen
Cristiano Paris
Cristiano Paris
Daniel Fischer
Ross Mellgren
Ryan Ingram
Svein Ove Aas