Re: [Haskell-cafe] NLP libraries and tools?

7 Jul 2011

      It's actually a shame we're discussing this on -cafe and not on -nlp. Then
again, maybe it's going to prompt somebody to join -nlp, and I'm gonna CC it
there, because some folks over there might be interested, but not read -cafe.

On Wed, Jul 06, 2011 at 07:22:41PM -0700, wren ng thornton wrote:
...
On 7/6/11 5:58 PM, Aleksandar Dimitrov wrote:
...
On Wed, Jul 06, 2011 at 09:32:27AM -0700, wren ng thornton wrote:
...
On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
...
Hi,
Continuing my search of Haskell NLP tools and libs, I wonder if the
following Haskell libraries exist (googling them does not help):
1) End of Sentence (EOS) Detection. Break text into a collection of
meaningful sentences.
Depending on how you mean, this is either fairly trivial (for English) or
an ill-defined problem. For things like determining whether the "."
character is intended as a full stop vs part of an abbreviation; that's
trivial.
I disagree. It's not exactly trivial in the sense that it is solved. It is
trivial in the sense that, usually, one would use a list of know
abbreviations
and just compare. This, however, just says that the most common approach is
trivial, not that the problem is.
Perhaps. I recall David Yarowsky suggesting it was considered solved (for
English, as I qualified earlier).
The solution I use is just to look at a window around the point and run a
standard feature-based machine learning algorithm over it[1]. Memorizing
known abbreviations is actually quite fragile, for reasons you mention.
This approach will give you accuracy in the high 90s, though I forget the
exact numbers.
That is indeed one of the best ways to do it (for Indo-European languages,
anyway.) When you mentioned Arabic for producing sentences that go on for ages —
you don't really need to go that far. I have had the doubtful pleasure of
reading Kant and Hegel in their original versions. In German, it is sometimes
still considered good style to write huge sentences. I once made it a point,
just to stick it to a Kant-loving-person, to produce a sentence that spanned 2
whole pages (A4.) It wasn't even difficult.

I sometimes think that we should just adopt a similar notion of "span," like
rhetorical structure theorists do. In that case, you're not segmenting
sentences, but discourse atoms — those are even more ill-defined, however.
...
But the problem is that what constitutes an appropriate solution for
computational needs is still very ill-defined.
Well, yes, and, well, no. Tokens are ill-defined. There's no good consensus on
how you should parse tokens (i.e., is "in spite of" one token or three?) either,
and so you just pick one that works for you. Same for sentence boundaries:
they're sometimes also ill-defined, but who says you need to define it well?

Maybe there's just a purpose-driven definition? — that people can agree on,
anyways. My purpose is either tagging, or parsing, or NE-detection, or
computational semantics… In all cases, I'm choosing the definition my tools can
use. Not because that's "correct," but I don't really need it to be, no?

I'm very much a "works for me" person in these matters. Mostly because I'm tired
of linguists fighting each other over trivial matters. Give me something I can
work with already!

Regards,
Aleks