
On Fri, Jul 1, 2011 at 2:52 PM, Dmitri O.Kondratiev
Any other then 'toktok' Haskell word tokenizer that compiles and works? I need something like: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.regexp.WordPunctT...
I don't think this exists out of the box, but since it appears to be a basic regex tokenizer, you could use Data.List.Split to create one. (or one of the regex libraries may be able to do this more simply). If you go the Data.List.Split route, I suspect you'll want to create a Splitter based on the whenElt Splitter: http://hackage.haskell.org/packages/archive/split/0.1.1/doc/html/Data-List-S... which takes a function from an element to a bool. (which you can implement however you wish, possibly with a regular expression, although it will have to be pure.) If you want something like a maxent tokenizer, then you're currently out of luck :( (as far as I know). --Rogan