populating a bloom filter; stymied by ST monad

I have a potentially very large number of values to feed into a bloom filter. They're coming from IO: getValues :: (v -> k -> k) -> k -> IO k getValues update initial = go initial =<< gen where go v [] = return v go v (f:fs) = do x <- val f go (maybe v (`update` v) x) fs gen = undefined val f = undefined This streams lazily, but if I build a list (getValues (:) []), the laziness is lost; the whole list is constructed and returned before any of it can be used. Which uses too much memory of course. So I can't use Data.BloomFilter.fromListB. Seems I need to do something like this: let filter = newMB (cheapHashes 13) 33554432 --- sized for 1 million items filter' <- unsafeFreezeMB <$> getValues insertMB filter Except this won't work, the mutable bloom filter uses the ST monad. And I cannot see a way to combine the three "MB" functions into a single computation in the same ST monad. Perhaps stToIO to is what I need, but I can't work out how to use it. Help? -- see shy jo

I mostly figured this out. stToIO allows converting (ST s a) to (IO a) and so I can update the bloom filter in IO. I do still wonder if there's a better way, avoiding modifying the filter inside IO and keeping it in ST. -- see shy jo

getValues update initial = go initial =<< gen where go v [] = return v go v (f:fs) = do You say that this stream lazily, so I deduce that gen produce a lazy IO list. So you should be able to use gen in conjunction with easyList to get your bloom filter lazily. I'm not sure what the problem is ? How exactly do you get the elements of your bloom filter from gen input ? -- Jedaï

Chaddaï Fouché wrote:
getValues update initial = go initial =<< gen where go v [] = return v go v (f:fs) = do x <- val f
You say that this stream lazily, so I deduce that gen produce a lazy IO list. So you should be able to use gen in conjunction with easyList to get your bloom filter lazily. I'm not sure what the problem is ? How exactly do you get the elements of your bloom filter from gen input ?
gen produces a lazy list, but it's then transformed using another IO operation. I added the relevant line back above. I did it that way to preserve laziness. An alternate, simpler getvalues suitable for easyList[1] would be the following, but due to the sequencing done by mapM, the list does not stream out lazily. getValues :: IO [v] getValues update initial = mapM val =<< gen -- see shy jo [1] If easyList didn't also destroy laziness by running length, anyway..

Hi,
You can probably use some unsafeInterleaveIO to keep things lazier. Note
that I am not actually suggesting this, an explicit streamy (iteratee,
enumerator, conduit, pipes, ..) solution as suggested by others is probably
superior. However I find that area too complicated for now and waiting for
libraries to settle a bit.
I'll give an example. How to incorporate this into your problem is up to
you, if you ever want to do that.
Try the following:
import Control.Applicative
import System.IO.Unsafe
-- f is a weird function. it prints the result of a computation before
returning it.
f :: Show a => IO a -> IO a
f i = do j <- i; print j; return j
-- let's have an IO [Int] list to use in tests.
xs :: IO [Int]
xs = return [1..10]
ghci> mapM f xs -- prints numbers from 1 to 10, and returns a list: [1..10]
ghci> take 3 <$> mapM f xs -- prints numbers from 1 to 10, and returns a
list: [1..3]
What if we want to be 'lazier', only print those numbers that are in the
output list? Albeit unsafe in certain cases, one way is the following:
g :: Show a => IO a -> IO a
g = unsafeInterleaveIO . f
ghci> take 3 <$> mapM g xs -- prints numbers from 1 to 3 and returns a list
[1..3]
HTH,
Ozgur
On 13 March 2012 15:16, Joey Hess
Chaddaï Fouché wrote:
getValues update initial = go initial =<< gen where go v [] = return v go v (f:fs) = do x <- val f
You say that this stream lazily, so I deduce that gen produce a lazy IO list. So you should be able to use gen in conjunction with easyList to get your bloom filter lazily. I'm not sure what the problem is ? How exactly do you get the elements of your bloom filter from gen input ?
gen produces a lazy list, but it's then transformed using another IO operation. I added the relevant line back above. I did it that way to preserve laziness. An alternate, simpler getvalues suitable for easyList[1] would be the following, but due to the sequencing done by mapM, the list does not stream out lazily.
getValues :: IO [v] getValues update initial = mapM val =<< gen
-- see shy jo
[1] If easyList didn't also destroy laziness by running length, anyway..
participants (3)
-
Chaddaï Fouché
-
Joey Hess
-
Ozgur Akgun