
It's actually a shame we're discussing this on -cafe and not on -nlp. Then again, maybe it's going to prompt somebody to join -nlp, and I'm gonna CC it there, because some folks over there might be interested, but not read -cafe. On Wed, Jul 06, 2011 at 07:22:41PM -0700, wren ng thornton wrote:
On 7/6/11 5:58 PM, Aleksandar Dimitrov wrote:
On Wed, Jul 06, 2011 at 09:32:27AM -0700, wren ng thornton wrote:
On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
Hi, Continuing my search of Haskell NLP tools and libs, I wonder if the following Haskell libraries exist (googling them does not help): 1) End of Sentence (EOS) Detection. Break text into a collection of meaningful sentences.
Depending on how you mean, this is either fairly trivial (for English) or an ill-defined problem. For things like determining whether the "." character is intended as a full stop vs part of an abbreviation; that's trivial.
I disagree. It's not exactly trivial in the sense that it is solved. It is trivial in the sense that, usually, one would use a list of know abbreviations and just compare. This, however, just says that the most common approach is trivial, not that the problem is.
Perhaps. I recall David Yarowsky suggesting it was considered solved (for English, as I qualified earlier).
The solution I use is just to look at a window around the point and run a standard feature-based machine learning algorithm over it[1]. Memorizing known abbreviations is actually quite fragile, for reasons you mention. This approach will give you accuracy in the high 90s, though I forget the exact numbers.
That is indeed one of the best ways to do it (for Indo-European languages, anyway.) When you mentioned Arabic for producing sentences that go on for ages — you don't really need to go that far. I have had the doubtful pleasure of reading Kant and Hegel in their original versions. In German, it is sometimes still considered good style to write huge sentences. I once made it a point, just to stick it to a Kant-loving-person, to produce a sentence that spanned 2 whole pages (A4.) It wasn't even difficult. I sometimes think that we should just adopt a similar notion of "span," like rhetorical structure theorists do. In that case, you're not segmenting sentences, but discourse atoms — those are even more ill-defined, however.
But the problem is that what constitutes an appropriate solution for computational needs is still very ill-defined.
Well, yes, and, well, no. Tokens are ill-defined. There's no good consensus on how you should parse tokens (i.e., is "in spite of" one token or three?) either, and so you just pick one that works for you. Same for sentence boundaries: they're sometimes also ill-defined, but who says you need to define it well? Maybe there's just a purpose-driven definition? — that people can agree on, anyways. My purpose is either tagging, or parsing, or NE-detection, or computational semantics… In all cases, I'm choosing the definition my tools can use. Not because that's "correct," but I don't really need it to be, no? I'm very much a "works for me" person in these matters. Mostly because I'm tired of linguists fighting each other over trivial matters. Give me something I can work with already! Regards, Aleks