
On 7/07/2011, at 7:04 AM, Dmitri O.Kondratiev wrote:
I am looking for Haskell implementation of sentence tokenizer such as described by Tibor Kiss and Jan Strunk’s in “Unsupervised Multilingual Sentence Boundary Detection”, which is implemented in NLTK:
That method is multilingual but relies on the text being written using fairly modern Western conventions, and tackles the problem of "too many dots" and not knowing which are abbreviation points and which full stops. I don't suppose anyone knows something that might help with the problem of too few dots? Run on sentences are one example.
I've been working over the last year+ on an optimized HMM-based POS tagger/supertagger with online tagging and anytime n-best tagging. I'm planning to release it this summer (i.e., by the end of August), though there are a few things I'd like to polish up before doing so. In particular, I want to make the package less monolithic. When I release it I'll make announcements here and on the nlp@ list.
One of the issues I've had with a POS tagger I've been using is that it makes some really stupid decisions which can be patched up with a few simple rules, but since it's distributed as a .jar file I cannot add those rules.