How would you hack it?

Andrew Coppin

4 Jun 2008 4 Jun '08

6:33 p.m.

So anyway, today I found myself wanting to build a program that generates some test data. I managed to make it *work* without too much difficulty, but it wasn't exactly "elegant". I'm curios to know how higher-order minds would approach this problem. It's not an especially "hard" problem, and I'm sure there are several good solutions possible. I have a file that contains several thousand words, seperated by white space. [I gather that on Unix there's a standard location for this file?] I want to end up with a file that contains a randomly-chosen selection of words. Actually, I'd like the end result to be a LaTeX file, containing sections, subsections, paragraphs and sentences. (Although obviously the actual sentences will be gibberish.) I'd like to be able to select how big the file should be, to within a few dozen characters anyway. Exact precision is not required. How would you do this? The approach I came up with is to slurp up the words like so: raw <- readFile "words.txt" let ws = words raw let n = length ws let wa = listArray (1,n) ws (I actually used lazy ByteStrings of characters.) So now I have an array of packed ByteStrings, and I can pick array indexes at random and use "unwords" to build my gibberish "sentences". The fun part is making a sentence come out the right size. There are two obvious possibilities: - Assume that all words are approximately N characters long, and estimate how many words a sentence therefore needs to contain to have the required length. - Start with an empty list and *actually count* the characters as you add each word. (You can prepend them rather than append them for extra efficiency.) I ended up taking the latter approach - at least at the sentence level. What I actually did was to write a function that builds lists of words. There is then another function that builds several sublists of approximately the prescribed lengths, inserts some commas, capitalises the first letter and appends a fullstop. This therefore generates a "sentence". After that, there's a function that builds several sentences of random size with random numbers of commas and makes a "paragraph" out of them. Next a function gathers several paragraphs and inserts a randomly-generated subsection heading. A similar function takes several subsections and adds a random section heading. In my current implementation, all of this is in the IO monad (so I can pick things randomly). Also, the whole thing looks very repetative and really if I thought about it harder, it ought to be possible to factor it out into something smaller and more ellegant. The clause function builds clauses of "approximately" the right size, and each function above that (sentence, paragraph, subsection, section) becomes progressively less accurate in its sizing. On the other hand, the final generated text always for example has 4 subsections to each section, and they always look visually the same size. I'd like to make it more random, but all the code works by estimating "roughly" how big a particular construct is, and therefore how many of then are required to full N characters. For larger and larger N, actually counting this stuff would seem somewhat inefficient, so I'm estimating. But that makes it hard to add more randomness without losing overall size control. The final file can end up being a fair way different in size to what I requested, and has an annoyingly regular internal structure. But essentially, it works. It would be nice to modify the code to generate HTML - but since it's coded so simple-mindedly, that would be a copy & paste job. Clearly, what I *should* have done is think more about a good abstraction before writing miles of code. ;-) So how would you guys do this?

Show replies by date

Gregory Collins

4 Jun 4 Jun

8:18 p.m.

Andrew Coppin writes:

...

Clearly, what I *should* have done is think more about a good abstraction before writing miles of code. ;-) So how would you guys do this?

If you want text that roughly resembles English, you're better off getting a corpus of real English text and running it through a Markov chain. Mark Dominus has written a few blog posts about this topic recently, see http://blog.plover.com/lang/finnpar.html. G.

Andrew Coppin

8:50 p.m.

Gregory Collins wrote:

...

Andrew Coppin writes:

...
Clearly, what I *should* have done is think more about a good abstraction before writing miles of code. ;-) So how would you guys do this?

If you want text that roughly resembles English, you're better off getting a corpus of real English text and running it through a Markov chain. Mark Dominus has written a few blog posts about this topic recently, see http://blog.plover.com/lang/finnpar.html.

This is probably overkill for my purposes. However, if you can find me a source that explains what a "Markov chain" actually *is*, I'd be quite interested. [I've seen it mentioned several times in relation to data compression, but Wikipedia's article is too cryptic for me to comprehend. I think Wikipedia is just a poor way to learn completely new concepts; Wikipedia's description of digital filters is also utterly incomprehensible, but it turns out the subject isn't actually that hard.]

John Melesky

9:33 p.m.

On Jun 4, 2008, at 3:50 PM, Andrew Coppin wrote:

...

However, if you can find me a source that explains what a "Markov chain" actually *is*, I'd be quite interested.

In a non-rigorous nutshell: You have the word "star". You want to pick another word to follow it. It turns out that, based on analyzing a corpus ("corpus", here means "bunch of text you extract information from"), the following word pairs occur: star wars (20% of the time) star cluster (5% of the time) star dust (10% of the time) star system (25% of the time) star -end-of-sentence- (5% of the time) . . . So you use those occurrence statistics to pick a feasible next word (let's choose "system", since it's the highest probability here -- in practice you'd probably choose one randomly based on a weighted likelihood). Then you look for all the word pairs which start with "system", and choose the next word in the same fashion. Repeat for as long as you want. Those word-pair statistics, when you have them for all the words in your vocabulary, comprise the first-level Markov data for your corpus. When you extend it to word triplets, it's second-level Markov data (and it will generate more reasonable fake text). You can build higher and higher Markov levels if you'd like. And, ultimately, though the example is about text, you can use this method to generate "realistic" sequences of any sequential data. Hope that helps. -johnnnnnnn

Henning Thielemann

5 Jun 5 Jun

5:46 a.m.

On Wed, 4 Jun 2008, John Melesky wrote:

...

So you use those occurrence statistics to pick a feasible next word (let's choose "system", since it's the highest probability here -- in practice you'd probably choose one randomly based on a weighted likelihood). Then you look for all the word pairs which start with "system", and choose the next word in the same fashion. Repeat for as long as you want.

"Markov chain" means, that you have a sequence of random experiments, where the outcome of each experiment depends exclusively on a fixed number (the level) of experiments immediately before the current one.

...

Those word-pair statistics, when you have them for all the words in your vocabulary, comprise the first-level Markov data for your corpus.

When you extend it to word triplets, it's second-level Markov data (and it will generate more reasonable fake text). You can build higher and higher Markov levels if you'd like.

If the level is too high, you will just reproduce the training text.

Andrew Coppin

7:35 p.m.

Henning Thielemann wrote:

...

"Markov chain" means, that you have a sequence of random experiments, where the outcome of each experiment depends exclusively on a fixed number (the level) of experiments immediately before the current one.

Right. So a "Markov chain" is actually a technical way of describing something that's intuitively pretty obvious? (E.g., PPM compression works by assuming that the input data is some sort of Markov chain with as-yet unknown transition probabilities.)

...

If the level is too high, you will just reproduce the training text.

Yeah, I can see that happening! ;-) The key, I think, is for the training set to be much larger than what you want to produce...

ajb＠spamcop.net

6 Jun 6 Jun

12:40 a.m.

G'day. Quoting Andrew Coppin :

...

Right. So a "Markov chain" is actually a technical way of describing something that's intuitively pretty obvious? (E.g., PPM compression works by assuming that the input data is some sort of Markov chain with as-yet unknown transition probabilities.)

Yes. In fact, DMC compression (which has been proven to be the same thing as PPM up to isomorphism) explicitly uses a Markov model. If you're curious, I recently put some code for building dynamic Markov models here. It's not pretty, but you might find it useful: http://andrew.bromage.org/darcs/dynamicmarkov/ Cheers, Andrew Bromage

Bulat Ziganshin

5 Jun 5 Jun

4:24 a.m.

New subject: Re[2]: How would you hack it?

Hello Andrew, Thursday, June 5, 2008, 12:50:28 AM, you wrote:

...

However, if you can find me a source that explains what a "Markov chain" actually *is*, I'd be quite interested.

*afaik*, the idea is simple: order-1 Markov chain is just text generated with respect of probability of each char, and order-n is a text generated using probability of each char in context of previous n-1 chars. you just need to gather stats using previous papers :) you can use this idea both to generate new naturally-looking words and to gather stats about words itself and generate naturally-looking sentences -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Achim Schneider

4 Jun 4 Jun

8:52 p.m.

Gregory Collins wrote:

...

Andrew Coppin writes:

...
Clearly, what I *should* have done is think more about a good abstraction before writing miles of code. ;-) So how would you guys do this?

If you want text that roughly resembles English, you're better off getting a corpus of real English text and running it through a Markov chain. Mark Dominus has written a few blog posts about this topic recently, see http://blog.plover.com/lang/finnpar.html.

If you run one over obscure academic papers, you can even generate publishable results. I don't have a link ready, but there was a fun incident involving this. -- (c) this sig last receiving data processing entity. Inspect headers for past copyright information. All rights reserved. Unauthorised copying, hiring, renting, public performance and/or broadcasting of this signature prohibited.

Henning Thielemann

8:58 p.m.

On Wed, 4 Jun 2008, Achim Schneider wrote:

...

Gregory Collins wrote:

...
Andrew Coppin writes:

...
Clearly, what I *should* have done is think more about a good abstraction before writing miles of code. ;-) So how would you guys do this?

If you want text that roughly resembles English, you're better off getting a corpus of real English text and running it through a Markov chain. Mark Dominus has written a few blog posts about this topic recently, see http://blog.plover.com/lang/finnpar.html.

If you run one over obscure academic papers, you can even generate publishable results. I don't have a link ready, but there was a fun incident involving this.

A famous paper generator is http://pdos.csail.mit.edu/scigen/

Jules Bean

5 Jun 5 Jun

10:26 a.m.

Achim Schneider wrote:

...

If you run one over obscure academic papers, you can even generate publishable results. I don't have a link ready, but there was a fun incident involving this.

http://www.physics.nyu.edu/faculty/sokal/dawkins.html

Bulat Ziganshin

4 Jun 4 Jun

8:21 p.m.

Hello Andrew, Wednesday, June 4, 2008, 10:33:00 PM, you wrote:

...

I have a file that contains several thousand words, seperated by white space. [I gather that on Unix there's a standard location for this file?] I want to end up with a file that contains a randomly-chosen selection of words.

does this letter was generated automatically? :) good work! :))) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Achim Schneider

8:32 p.m.

Andrew Coppin wrote:

...

I have a file that contains several thousand words, seperated by white space. [I gather that on Unix there's a standard location for this file?]

Looking at /usr/share/dict/words, I'm assured that the proper seperator is \n.

...

Clearly, what I *should* have done is think more about a good abstraction before writing miles of code. ;-) So how would you guys do this?

Generate a Map Int [String] map, with the latter list being an infinite list of words with that particular size. Now assume that you want to have a 100 character sentence. You start by looking if you got any 100 character word, if yes it's your sentence, if not you divide it in half (maybe offset by a weighted random factor [1]) and start over again. You can then specify your whole document along the lines of (capitalise $ words 100) ++ ". " ++ (capitalise $ words 10) ++ "?" ++ (capitalise $ words 20) ++ "oneone1!" [1] Random midpoint displacement is a very interesting topic by itself. -- (c) this sig last receiving data processing entity. Inspect headers for past copyright information. All rights reserved. Unauthorised copying, hiring, renting, public performance and/or broadcasting of this signature prohibited.

Andrew Coppin

8:54 p.m.

Achim Schneider wrote:

...

Andrew Coppin wrote:

...
I have a file that contains several thousand words, seperated by white space. [I gather that on Unix there's a standard location for this file?] Looking at /usr/share/dict/words, I'm assured that the proper seperator is \n.

Thanks. I did look around trying to find this, but ultimately failed. (Is it a standard component, or is it installed as part of some specific application?) As I understand it, Haskell's "words" function will work on any kind of white space - spaces, line feeds, caridge returns, tabs, etc. - so it should be fine. ;-) Since I'm developing on Windows, what I actually did was have Google find me a file online that I can download. [Remember my post a while back? "GHC panic"? Apparently GHC doesn't like it if you try to represent the entire 400 KB file as a single [String]...]

...

Generate a Map Int [String] map, with the latter list being an infinite list of words with that particular size.

Now assume that you want to have a 100 character sentence. You start by looking if you got any 100 character word, if yes it's your sentence, if not you divide it in half (maybe offset by a weighted random factor [1]) and start over again.

You can then specify your whole document along the lines of

(capitalise $ words 100) ++ ". " ++ (capitalise $ words 10) ++ "?" ++ (capitalise $ words 20) ++ "oneone1!"

[1] Random midpoint displacement is a very interesting topic by itself.

I'm not following your logic, sorry...

Achim Schneider

9:25 p.m.

Andrew Coppin wrote:

...

Achim Schneider wrote:

...
Andrew Coppin wrote:

...
I have a file that contains several thousand words, seperated by white space. [I gather that on Unix there's a standard location for this file?] Looking at /usr/share/dict/words, I'm assured that the proper seperator is \n.

Thanks. I did look around trying to find this, but ultimately failed. (Is it a standard component, or is it installed as part of some specific application?)

ksf@solaris ~ % equery b /usr/share/dict/words [ Searching for file(s) /usr/share/dict/words in *... ] sys-apps/miscfiles-1.4.2 (/usr/share/dict/words) ksf@solaris ~ % eix miscfiles [I] sys-apps/miscfiles Available versions: 1.4.2 {minimal} Installed versions: 1.4.2(18:27:27 02/14/07)(-minimal) Homepage: http://www.gnu.org/directory/miscfiles.html Description: Miscellaneous files

...

...
Generate a Map Int [String] map, with the latter list being an infinite list of words with that particular size.

Now assume that you want to have a 100 character sentence. You start by looking if you got any 100 character word, if yes it's your sentence, if not you divide it in half (maybe offset by a weighted random factor [1]) and start over again.

You can then specify your whole document along the lines of

(capitalise $ words 100) ++ ". " ++ (capitalise $ words 10) ++ "?" ++ (capitalise $ words 20) ++ "oneone1!"

[1] Random midpoint displacement is a very interesting topic by itself.

I'm not following your logic, sorry...

That's probably because I just described the points and not the rest of the morphisms... imagine some plumbing and tape between my sentences. Midpoint displacement is a great way to achieve randomness while still keeping a uniform appearance. In the defining paper, that I don't have ready right now, an example was shown where a realistic outline of Australia was generated from ten or so data points: If you display it next to the actual outline, only a geographer could tell which one's the fake. -- (c) this sig last receiving data processing entity. Inspect headers for past copyright information. All rights reserved. Unauthorised copying, hiring, renting, public performance and/or broadcasting of this signature prohibited.

Ben Franksen

5 Jun 5 Jun

7:36 p.m.

Achim Schneider wrote:

...

Andrew Coppin wrote:

...
Achim Schneider wrote:

...
Andrew Coppin wrote:

...
I have a file that contains several thousand words, seperated by white space. [I gather that on Unix there's a standard location for this file?] Looking at /usr/share/dict/words, I'm assured that the proper seperator is \n.

Thanks. I did look around trying to find this, but ultimately failed. (Is it a standard component, or is it installed as part of some specific application?)

ksf@solaris ~ % equery b /usr/share/dict/words [ Searching for file(s) /usr/share/dict/words in *... ] sys-apps/miscfiles-1.4.2 (/usr/share/dict/words) ksf@solaris ~ % eix miscfiles [I] sys-apps/miscfiles Available versions: 1.4.2 {minimal} Installed versions: 1.4.2(18:27:27 02/14/07)(-minimal) Homepage: http://www.gnu.org/directory/miscfiles.html Description: Miscellaneous files

On Ubuntu (and supposedly debian): ben@sarun[1]: ~ > dpkg -S /usr/share/dict/words dictionaries-common: /usr/share/dict/words Cheers Ben

Henning Thielemann

4 Jun 4 Jun

8:55 p.m.

On Wed, 4 Jun 2008, Andrew Coppin wrote:

...

How would you do this?

The approach I came up with is to slurp up the words like so:

raw <- readFile "words.txt" let ws = words raw let n = length ws let wa = listArray (1,n) ws

(I actually used lazy ByteStrings of characters.) So now I have an array of packed ByteStrings, and I can pick array indexes at random and use "unwords" to build my gibberish "sentences".

Sounds like a generator for scientific articles. :-) Maybe http://hackage.haskell.org/cgi-bin/hackage-scripts/package/markov-chain can be of help for you. It's also free of randomIO.

...

In my current implementation, all of this is in the IO monad (so I can pick things randomly).

You know of http://www.haskell.org/pipermail/haskell-cafe/2006-December/020005.html ?

Andrew Coppin

9:19 p.m.

Henning Thielemann wrote:

...

Sounds like a generator for scientific articles. :-) Maybe http://hackage.haskell.org/cgi-bin/hackage-scripts/package/markov-chain can be of help for you. It's also free of randomIO.

That certainly looks interesting. Presumably if I train it right, it'll figure out that sentences need to start uppercase and end with a full-stop, and maybe have a few other punctuation marks thrown in. I'm not sure I trust it to generate valid LaTeX markup, but I can give it a try! ;-) At any rate, thanks for the link - the possibilities for this look very interesting. (All sorts of data to be generated. And hey, maybe if I read the source I can even learn some of the theory behind it too...)

Lanny Ripple

10:23 p.m.

You might want to skim Shannon's 'A Mathematical Theory of Communcations'. Part 1, Section 2 and 3 are almost exactly your topic. (Or at least the direction you've headed in. :) http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf -ljr Andrew Coppin wrote:

...

Henning Thielemann wrote:

...
Sounds like a generator for scientific articles. :-) Maybe

http://hackage.haskell.org/cgi-bin/hackage-scripts/package/markov-chain can be of help for you. It's also free of randomIO.

That certainly looks interesting. Presumably if I train it right, it'll figure out that sentences need to start uppercase and end with a full-stop, and maybe have a few other punctuation marks thrown in. I'm not sure I trust it to generate valid LaTeX markup, but I can give it a try! ;-)

At any rate, thanks for the link - the possibilities for this look very interesting. (All sorts of data to be generated. And hey, maybe if I read the source I can even learn some of the theory behind it too...)

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Evan Laforge

5 Jun 5 Jun

3:14 a.m.

On Wed, Jun 4, 2008 at 3:23 PM, Lanny Ripple wrote:

...

You might want to skim Shannon's 'A Mathematical Theory of Communcations'. Part 1, Section 2 and 3 are almost exactly your topic. (Or at least the direction you've headed in. :)

http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

Rob Pike's "Practice of Programming" has a few chapters with a markov generator in several different languages as a case study. None of them are functional, but it's a pretty clear description of the idea. The "plan9" in the url reminded me :)

Lanny Ripple

5:06 p.m.

Wow. Did I really quote the article as "Communcations"? (A new, little known work by Shannon! :/ The link is correct anyway. -ljr Evan Laforge wrote:

...

On Wed, Jun 4, 2008 at 3:23 PM, Lanny Ripple wrote:

...
You might want to skim Shannon's 'A Mathematical Theory of Communcations'. Part 1, Section 2 and 3 are almost exactly your topic. (Or at least the direction you've headed in. :)

http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

Rob Pike's "Practice of Programming" has a few chapters with a markov generator in several different languages as a case study. None of them are functional, but it's a pretty clear description of the idea.

The "plan9" in the url reminded me :)

Henning Thielemann

5:43 a.m.

On Wed, 4 Jun 2008, Andrew Coppin wrote:

...

Henning Thielemann wrote:

...
Sounds like a generator for scientific articles. :-) Maybe http://hackage.haskell.org/cgi-bin/hackage-scripts/package/markov-chain can be of help for you. It's also free of randomIO.

That certainly looks interesting. Presumably if I train it right, it'll figure out that sentences need to start uppercase and end with a full-stop, and maybe have a few other punctuation marks thrown in.

If you use different encodings for periods for abbreviations and sentence ends, this is warranted.

...

I'm not sure I trust it to generate valid LaTeX markup, but I can give it a try! ;-)

You may not want to construct a Markov Chain for characters but for larger objects, say words and LaTeX commands. However when it comes to correct parentheses, a Markov Chain might not be the appropriate tool, but a grammar.

Bulat Ziganshin

4:20 a.m.

New subject: Re[2]: How would you hack it?

Hello Henning, Thursday, June 5, 2008, 12:55:19 AM, you wrote:

...

Sounds like a generator for scientific articles. :-) Maybe

http://hackage.haskell.org/cgi-bin/hackage-scripts/package/markov-chain can be of help for you. It's also free of randomIO.

once i've used something like this to generate this brilliant text: can INR-s O, by thenion, Laertes, These at these haste is defeling and mornione. KING CLAUDIUS Heat killowers, and our vowst' GUILDENSTERN Ay, magamovinob vow; And lo leave they come, and he sole tobly all you, And, or spomir: So, wherefore hiacriful at ast, my lord, I have no orason; Frous stonbitter; at I'll I think himbers, seempended body jooke! That would be tracious arest, Rt: sor 'tis timeake or bifuchol gugnme: If thoughRqublest they are gave with this cercric if, for the no POLONIUS] KING CLAUDIUS I will violect of the queen with Dity. ROSENCRANTZ 'Stive to came fat: the ne further then, which he his worldy, it is means me. i've put a full copy to http://haskell.org/bz/not_exactly_hamlet.txt -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Achim Schneider

10:51 a.m.

Henning Thielemann wrote:

...

Sounds like a generator for scientific articles. :-) Maybe http://hackage.haskell.org/cgi-bin/hackage-scripts/package/markov-chain can be of help for you. It's also free of randomIO.

I once invented this, though ungeneralised, for a map generator of a RTS... the river always went from the left to the right, at approximate the middle of the map, its direction being dependant on its current offset from that middle, its width (that is, a wide river can bend upwards and downwards more than one tile) and a random factor. You can also express this using a markoff chain. -- (c) this sig last receiving data processing entity. Inspect headers for past copyright information. All rights reserved. Unauthorised copying, hiring, renting, public performance and/or broadcasting of this signature prohibited.

6240

Age (days ago)

6242

Last active (days ago)

List overview

Download

23 comments

11 participants

participants (11)

Achim Schneider
ajb＠spamcop.net
Andrew Coppin
Ben Franksen
Bulat Ziganshin
Evan Laforge
Gregory Collins
Henning Thielemann
John Melesky
Jules Bean
Lanny Ripple