
On Wed, Jul 06, 2011 at 09:32:27AM -0700, wren ng thornton wrote:
On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
Hi, Continuing my search of Haskell NLP tools and libs, I wonder if the following Haskell libraries exist (googling them does not help): 1) End of Sentence (EOS) Detection. Break text into a collection of meaningful sentences.
Depending on how you mean, this is either fairly trivial (for English) or an ill-defined problem. For things like determining whether the "." character is intended as a full stop vs part of an abbreviation; that's trivial.
I disagree. It's not exactly trivial in the sense that it is solved. It is trivial in the sense that, usually, one would use a list of know abbreviations and just compare. This, however, just says that the most common approach is trivial, not that the problem is. There are cases where, for example, an abbreviation and a full stop will coincide. In these cases, you'll often need full-blown parsing or at least a well-trained maxent classifier. There are other problems: ordinals, acronyms which themselves also have periods in them, weird names (like Yahoo!) and initials, to name a few. This is only for English and similar languages, mind you.
But for general sentence breaking, how do you intend to deal with quotations? What about when news articles quote someone uttering a few sentences before the end-quote marker? So far as I'm aware, there's no satisfactory definition of what the solution should be in all reasonable cases. A "sentence" isn't really very well-defined in practice.
As long as you have one routine and stick to it, you don't need a formal definition every linguist will agree on. Computational Linguists (and their tools,) more often than not, just need a dependable solution, not a correct one.
2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to each token.
There are numerous approaches to this problem; do you care about the solution, or will any one of them suffice?
I've been working over the last year+ on an optimized HMM-based POS tagger/supertagger with online tagging and anytime n-best tagging. I'm planning to release it this summer (i.e., by the end of August), though there are a few things I'd like to polish up before doing so. In particular, I want to make the package less monolithic. When I release it I'll make announcements here and on the nlp@ list.
I'm very interested in your progress! Keep us posted :-) Regards, Aleks