operating on a hundred files at once

I have a series of NxM numeric tables I'm doing a quick mean/variance/t-test etcetera on. The cell t1 [i,j] corresponds exactly to the cells t2..N [i,j], and so it's perfectly possible to read one item at a time from each of the 100 files and compute the mean/variance etcetera on all cells that way. So what I propose to do is something along the lines of: openAndProcess filename = f <- readFile filename return (map (L.split ',') . lines $ f) main = do fs <- getArgs let items = map (map read) . map openAndProcess fs in do print . map (map $ mean) items print . map (map $ variance) items How close am I to doing the right thing here? As I understand it, this will result in one hundred IO [String] instances being returned by the call to (map openAndProcess $ filenames). Do I need to do something special to lift (read), (mean), and (variance), or even (map) into the IO monad so they can process the input as needed? Thanks in advance, -- Jeff

Hello Jefferson, Monday, April 9, 2007, 9:34:12 PM, you wrote: if you have enough memory available, the fastest way is to read file to memory using bytestring, convert it into array of doubles, repeating this step for all files. then perform your computations. if you will try to read 100 files simultaneously, this may lead to extensive disk seeking or cpu cache trashing ... even better, you should read one file, add its values to the accumulators, then read next file...
I have a series of NxM numeric tables I'm doing a quick mean/variance/t-test etcetera on. The cell t1 [i,j] corresponds exactly to the cells t2..N [i,j], and so it's perfectly possible to read one item at a time from each of the 100 files and compute the mean/variance etcetera on all cells that way. So what I propose to do is something along the lines of:
openAndProcess filename = f <- readFile filename return (map (L.split ',') . lines $ f)
main = do fs <- getArgs let items = map (map read) . map openAndProcess fs in do print . map (map $ mean) items print . map (map $ variance) items
How close am I to doing the right thing here? As I understand it, this will result in one hundred IO [String] instances being returned by the call to (map openAndProcess $ filenames). Do I need to do something special to lift (read), (mean), and (variance), or even (map) into the IO monad so they can process the input as needed?
Thanks in advance, -- Jeff
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Thanks for the advice. I'm not so much interested in performance here, as this is just a one-off. Disk thrashing or not, these files are only a few hundred K apiece, and I can't imagine that the whole computation will take more than a few minutes. My question is more about how to deal with the IO monad "pollution" of all the data in a situation where you have N instances of IO [a] at step 1, and you have M computations to perform on those instances, which are all monad-free. -- Jeff On Mon, 2007-04-09 at 22:24 +0400, Bulat Ziganshin wrote:
Hello Jefferson,
Monday, April 9, 2007, 9:34:12 PM, you wrote:
if you have enough memory available, the fastest way is to read file to memory using bytestring, convert it into array of doubles, repeating this step for all files. then perform your computations. if you will try to read 100 files simultaneously, this may lead to extensive disk seeking or cpu cache trashing
... even better, you should read one file, add its values to the accumulators, then read next file...
I have a series of NxM numeric tables I'm doing a quick mean/variance/t-test etcetera on. The cell t1 [i,j] corresponds exactly to the cells t2..N [i,j], and so it's perfectly possible to read one item at a time from each of the 100 files and compute the mean/variance etcetera on all cells that way. So what I propose to do is something along the lines of:
openAndProcess filename = f <- readFile filename return (map (L.split ',') . lines $ f)
main = do fs <- getArgs let items = map (map read) . map openAndProcess fs in do print . map (map $ mean) items print . map (map $ variance) items
How close am I to doing the right thing here? As I understand it, this will result in one hundred IO [String] instances being returned by the call to (map openAndProcess $ filenames). Do I need to do something special to lift (read), (mean), and (variance), or even (map) into the IO monad so they can process the input as needed?
Thanks in advance, -- Jeff
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Mon, 2007-04-09 at 14:40 -0400, Jefferson Heard wrote:
Thanks for the advice. I'm not so much interested in performance here, as this is just a one-off. Disk thrashing or not, these files are only a few hundred K apiece, and I can't imagine that the whole computation will take more than a few minutes.
My question is more about how to deal with the IO monad "pollution" of all the data in a situation where you have N instances of IO [a] at step 1, and you have M computations to perform on those instances, which are all monad-free.
Perhaps you want one of these functions: sequence :: Monad m => [m a] -> m [a] sequence_ :: Monad m => [m a] -> m () for example in the case of IO it's: sequence :: [IO a] -> IO [a] sequence_ :: [IO a] -> IO () ie it takes a pure list of IO actions and sticks them together into one IO action, or to put it another way, it performs all the actions in sequence. Is this what you meant? Duncan

It is indeed! Is that to be found in Control.Monad, I take it? On Tue, 2007-04-10 at 08:50 +1000, Duncan Coutts wrote:
On Mon, 2007-04-09 at 14:40 -0400, Jefferson Heard wrote:
Thanks for the advice. I'm not so much interested in performance here, as this is just a one-off. Disk thrashing or not, these files are only a few hundred K apiece, and I can't imagine that the whole computation will take more than a few minutes.
My question is more about how to deal with the IO monad "pollution" of all the data in a situation where you have N instances of IO [a] at step 1, and you have M computations to perform on those instances, which are all monad-free.
Perhaps you want one of these functions:
sequence :: Monad m => [m a] -> m [a]
sequence_ :: Monad m => [m a] -> m ()
for example in the case of IO it's:
sequence :: [IO a] -> IO [a] sequence_ :: [IO a] -> IO ()
ie it takes a pure list of IO actions and sticks them together into one IO action, or to put it another way, it performs all the actions in sequence.
Is this what you meant?
Duncan

On Mon, 2007-04-09 at 21:24 -0400, Jefferson Heard wrote:
It is indeed! Is that to be found in Control.Monad, I take it?
Yes. Other common derivatives in that module include: mapM f as = sequence (map f as) mapM_ f as = sequence_ (map f as) forM_ = flip mapM_ forM = flip mapM however, personally, I think it's better style to use sequence/sequence_ directly since it shows more nicely how the IO is only going on at the top level whereas forM_ starts to feel very imperative. A nice style, imho, is to use sequence and list comprehensions, eg, reformulating your original program: main = do files <- getArgs sequence_ [ print . map (mean . map read . split ',') . lines =<< readFile file | file <- files ] sequence_ [ print . map (variance . map read . split ',') . lines =<< readFile file | file <- files ] Note, that like in your original we read each file twice, once for the mean and once for the variance. In your program you share the action to do the reading, but not the result of the action, so you do execute those file reading actions twice. Let's try fixing that: main = do files <- getArgs items <- sequence [ return . map (map read . split ',') . lines =<< readFile file | file <- files ] sequence_ [ print (map mean item) | item <- items ] sequence_ [ print (map variance item) | item <- items ] The next problem to note is that we read all the files before calculating the info we want from each file. So opening 100 files might be ok but 100,000 might not, so lets refactor again: main = do files <- getArgs mvs <- sequence [ return . calc . map (map read . split ',') . lines =<< readFile file | file <- files ] let (ms, vs) = unzip mvs mapM_ print ms mapM_ print vs calc items = (map mean items, map variance items) However this doesn't quite cut it either since nothing forces the mean and variance to be calculated until the very end where they get printed. This means the file will need to be held open until then too. We need to force the calculation of the mean and variance which will consume the file contents and allow the file to be closed. We could do this by printing the mean and variance out earlier (ie as soon as we've read the file) but that's changing the order of the lines in the output of your program (you print all the means followed by all the variance lines) so instead lets add a bit of strictness: import Control.Exception (evaluate) main = do files <- getArgs mvs <- sequence [ evaluate . calc . map (map read . split ',') . lines =<< readFile file | file <- files ] let (ms, vs) = unzip mvs mapM_ print ms mapM_ print vs calc items = m `seq` v `seq` (m, v) where m = map mean items v = map variance items We've changed two bits. We make sure that calc calculates the two values before returning the pair and we force the evaluation of calc itself using evaluate. Both are necessary. The evaluate action forces a value as an IO action, so we can be sure about when the forcing happens in relation to other IO actions. Duncan

On Tue, 2007-04-10 at 13:16 +1000, Duncan Coutts wrote:
Note, that like in your original we read each file twice, once for the mean and once for the variance.
As an aside, you can calculate both mean and variance in one pass (and constant space) by calculating the sum of elements 'x', the sum of squared elements 'x2', and keeping track of the number of elements 'n'. mean = x/n var = (x2-mean*mean*n)/(n-1) If you track the sum of cubed elements (x3) and the powers of four (x4), you also get kurtosis and skew in a similar manner. -k

Thanks, Ketil. I knew I could calcuate the mean in constant space, but I didn't think about the variance. Much appreciated. On Tue, 2007-04-10 at 08:30 +0200, Ketil Malde wrote:
On Tue, 2007-04-10 at 13:16 +1000, Duncan Coutts wrote:
Note, that like in your original we read each file twice, once for the mean and once for the variance.
As an aside, you can calculate both mean and variance in one pass (and constant space) by calculating the sum of elements 'x', the sum of squared elements 'x2', and keeping track of the number of elements 'n'.
mean = x/n var = (x2-mean*mean*n)/(n-1)
If you track the sum of cubed elements (x3) and the powers of four (x4), you also get kurtosis and skew in a similar manner.
-k
participants (5)
-
Bulat Ziganshin
-
Duncan Coutts
-
Jefferson Heard
-
Ketil Malde
-
Stefan O'Rear