
On Mon, Jul 19, 2010 at 9:24 AM, David Virebayre
A minor point: instead of removing the punctuation, you maybe should convert it to whitespace.
Otherwise in texts like "there was a quick,brown fox" (notice the missing space after the comma) you'll have the word "quickbrown" instead of 2 words "quick" and "brown".
If you remove punctuation you - run the risk of joining two valid words into one invalid word: "quick,brown" -> "quickbrown" - run the risk of converting one word into a different word: "can't" -> "cant" "won't" -> "wont" If you split at punctuation you create more semi-words: "can't" -> "can", "t" "shouldn't" -> "shouldn" "t" It might be better regarding in-word apostrophes as letters in this case? -- Dougal Stanton dougal@dougalstanton.net // http://www.dougalstanton.net