How to select n random words from a file ...

Hi, I'm clearly new to haskell, and I suppose those is a basic question, but after a bit of searching I've been unable to figure out the "best" way to do this. I was first trying to find out how to, say, get a random element from a list, but I'm starting to think that may not be the best way (because list access isn't constant time, among other reasons). So, I wonder if there is a more direct way to just get a random word from a wordfile (i.e. one word per line) ... can anyone suggest a method? Ideally I'd like to be able to adapt this to get n random words from several files (i.e. randomly select n words from all files; i.e. just read them into one merged array and choose from there?) Thanks for any advice ... -- Noon Silk Fancy a quantum lunch? https://sites.google.com/site/quantumlunch/ "Every morning when I wake up, I experience an exquisite joy — the joy of being this signature."

Hi,
Look up "reservoir sampling", it will most probably help.
On Sun, Jun 10, 2012 at 6:21 AM, Noon Silk
Hi,
I'm clearly new to haskell, and I suppose those is a basic question, but after a bit of searching I've been unable to figure out the "best" way to do this. I was first trying to find out how to, say, get a random element from a list, but I'm starting to think that may not be the best way (because list access isn't constant time, among other reasons). So, I wonder if there is a more direct way to just get a random word from a wordfile (i.e. one word per line) ... can anyone suggest a method?
Ideally I'd like to be able to adapt this to get n random words from several files (i.e. randomly select n words from all files; i.e. just read them into one merged array and choose from there?)
Thanks for any advice ...
-- Noon Silk
Fancy a quantum lunch? https://sites.google.com/site/quantumlunch/
"Every morning when I wake up, I experience an exquisite joy — the joy of being this signature."
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/

On Sun, Jun 10, 2012 at 6:21 AM, Noon Silk
Hi,
I'm clearly new to haskell, and I suppose those is a basic question, but after a bit of searching I've been unable to figure out the "best" way to do this. I was first trying to find out how to, say, get a random element from a list, but I'm starting to think that may not be the best way (because list access isn't constant time, among other reasons). So, I wonder if there is a more direct way to just get a random word from a wordfile (i.e. one word per line) ... can anyone suggest a method?
My preferred option, assuming the file sizes make it amenable, is to use Iteratees to fold the wordlists into an IntMap of words. You can then take the size of the map and choose n unique Ints in the range. Note that any algorithm is going to include these basic steps. (Get the size, pick randoms, access the words at the key/line number). The most direct approach would basically be imperative and in the IO monad, using things like openFile and hSeek. This approach gets harder if you want to pull words out of multiple wordlists.

An interesting related problem is if you are only allowed one pass through the data how would you randomly choose one word. -- -- Regards, KC

Pick the i-th word (replacing the previously chosen word, if any) with
probability 1/i? (numbering of words starts from 1 instead of 0).
On 11 June 2012 11:13, KC
An interesting related problem is if you are only allowed one pass through the data how would you randomly choose one word.
-- -- Regards, KC
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

KC:
An interesting related problem is if you are only allowed one pass through the data how would you randomly choose one word.
Let's choose n items. You must know the length of the sequence, of course, otherwise the 'probability' loses its sense. So, for lists it might not be "just one pass"... Suppose the length of the sequence be m. Suppose you have a random generator called rg, just a simple function which transforms: seed -> seed' , between 0 and 1 Make n and m real to make the typechecker happy. Then the most straightforward solution for lists is: nran l n = nr l m n seed where m = fromIntegral(length(l)) nr [] _ _ _ = [] nr (x:q) m n seed = let seed'=rg seed in if seed' < n/m then x:nr q (m-1) (n-1) seed' else nr q (m-1) n seed' -- ========= Now, you may make it tail-recursive, use a different random generation protocol, or whatever. I believe that this solution is known for years... Jerzy Karczmarczuk
participants (6)
-
Aditya Manthramurthy
-
Alexander Solla
-
Eugene Kirpichov
-
Jerzy Karczmarczuk
-
KC
-
Noon Silk