Re: [Haskell-cafe] NLP libraries and tools?

On 7/6/11 5:58 PM, Aleksandar Dimitrov wrote:
On Wed, Jul 06, 2011 at 09:32:27AM -0700, wren ng thornton wrote:
On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
Hi, Continuing my search of Haskell NLP tools and libs, I wonder if the following Haskell libraries exist (googling them does not help): 1) End of Sentence (EOS) Detection. Break text into a collection of meaningful sentences.
Depending on how you mean, this is either fairly trivial (for English) or an ill-defined problem. For things like determining whether the "." character is intended as a full stop vs part of an abbreviation; that's trivial.
I disagree. It's not exactly trivial in the sense that it is solved. It is trivial in the sense that, usually, one would use a list of know abbreviations and just compare. This, however, just says that the most common approach is trivial, not that the problem is.
Perhaps. I recall David Yarowsky suggesting it was considered solved (for English, as I qualified earlier). The solution I use is just to look at a window around the point and run a standard feature-based machine learning algorithm over it[1]. Memorizing known abbreviations is actually quite fragile, for reasons you mention. This approach will give you accuracy in the high 90s, though I forget the exact numbers. [1] With obvious features like whether the following word is capitalized, whether the preceding word is capitalized, length of the preceding word, whether there's another period after the next word,...
But for general sentence breaking, how do you intend to deal with quotations? What about when news articles quote someone uttering a few sentences before the end-quote marker? So far as I'm aware, there's no satisfactory definition of what the solution should be in all reasonable cases. A "sentence" isn't really very well-defined in practice.
As long as you have one routine and stick to it, you don't need a formal definition every linguist will agree on. Computational Linguists (and their tools,) more often than not, just need a dependable solution, not a correct one.
But the problem is that what constitutes an appropriate solution for computational needs is still very ill-defined. For example, the treatment of quotations will depend on the grammar theory used in the tagger, parser, translator, etc. The quality of output is often quite susceptible to EOS being meaningfully[2] distributed. Thus, what constitutes a "dependable" solution often varies depending on the task in question.[3] Also, a lot of the tools in this area assume there's some sort of punctuation marking the end of sentences, even if it's unreliable as an EOS indicator. That works well enough for languages with European-like orthographic traditions, but it falls apart quite rapidly when moving to East Asian languages (e.g., Burmese, Thai,...). And languages like Japanese or Arabic can have "sentences" that go on forever, but are best handled by chunking them into clauses. [2] In a statistical sense, relative to the structure of the model. [3] Personally, I think the idea of having a single EOS type is the bulk of the problem. If we allowed for different kinds of EOS in grammars then the upstream tools could handle sentence fragments better, which would make it easier to make fragment breaking reliable.
I've been working over the last year+ on an optimized HMM-based POS tagger/supertagger with online tagging and anytime n-best tagging. I'm planning to release it this summer (i.e., by the end of August), though there are a few things I'd like to polish up before doing so. In particular, I want to make the package less monolithic. When I release it I'll make announcements here and on the nlp@ list.
I'm very interested in your progress! Keep us posted :-)
Will do :) -- Live well, ~wren

It's actually a shame we're discussing this on -cafe and not on -nlp. Then again, maybe it's going to prompt somebody to join -nlp, and I'm gonna CC it there, because some folks over there might be interested, but not read -cafe. On Wed, Jul 06, 2011 at 07:22:41PM -0700, wren ng thornton wrote:
On 7/6/11 5:58 PM, Aleksandar Dimitrov wrote:
On Wed, Jul 06, 2011 at 09:32:27AM -0700, wren ng thornton wrote:
On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
Hi, Continuing my search of Haskell NLP tools and libs, I wonder if the following Haskell libraries exist (googling them does not help): 1) End of Sentence (EOS) Detection. Break text into a collection of meaningful sentences.
Depending on how you mean, this is either fairly trivial (for English) or an ill-defined problem. For things like determining whether the "." character is intended as a full stop vs part of an abbreviation; that's trivial.
I disagree. It's not exactly trivial in the sense that it is solved. It is trivial in the sense that, usually, one would use a list of know abbreviations and just compare. This, however, just says that the most common approach is trivial, not that the problem is.
Perhaps. I recall David Yarowsky suggesting it was considered solved (for English, as I qualified earlier).
The solution I use is just to look at a window around the point and run a standard feature-based machine learning algorithm over it[1]. Memorizing known abbreviations is actually quite fragile, for reasons you mention. This approach will give you accuracy in the high 90s, though I forget the exact numbers.
That is indeed one of the best ways to do it (for Indo-European languages, anyway.) When you mentioned Arabic for producing sentences that go on for ages — you don't really need to go that far. I have had the doubtful pleasure of reading Kant and Hegel in their original versions. In German, it is sometimes still considered good style to write huge sentences. I once made it a point, just to stick it to a Kant-loving-person, to produce a sentence that spanned 2 whole pages (A4.) It wasn't even difficult. I sometimes think that we should just adopt a similar notion of "span," like rhetorical structure theorists do. In that case, you're not segmenting sentences, but discourse atoms — those are even more ill-defined, however.
But the problem is that what constitutes an appropriate solution for computational needs is still very ill-defined.
Well, yes, and, well, no. Tokens are ill-defined. There's no good consensus on how you should parse tokens (i.e., is "in spite of" one token or three?) either, and so you just pick one that works for you. Same for sentence boundaries: they're sometimes also ill-defined, but who says you need to define it well? Maybe there's just a purpose-driven definition? — that people can agree on, anyways. My purpose is either tagging, or parsing, or NE-detection, or computational semantics… In all cases, I'm choosing the definition my tools can use. Not because that's "correct," but I don't really need it to be, no? I'm very much a "works for me" person in these matters. Mostly because I'm tired of linguists fighting each other over trivial matters. Give me something I can work with already! Regards, Aleks
participants (2)
-
Aleksandar Dimitrov
-
wren ng thornton