
On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
Hi, Continuing my search of Haskell NLP tools and libs, I wonder if the following Haskell libraries exist (googling them does not help): 1) End of Sentence (EOS) Detection. Break text into a collection of meaningful sentences.
Depending on how you mean, this is either fairly trivial (for English) or an ill-defined problem. For things like determining whether the "." character is intended as a full stop vs part of an abbreviation; that's trivial. But for general sentence breaking, how do you intend to deal with quotations? What about when news articles quote someone uttering a few sentences before the end-quote marker? So far as I'm aware, there's no satisfactory definition of what the solution should be in all reasonable cases. A "sentence" isn't really very well-defined in practice.
2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to each token.
There are numerous approaches to this problem; do you care about the solution, or will any one of them suffice? I've been working over the last year+ on an optimized HMM-based POS tagger/supertagger with online tagging and anytime n-best tagging. I'm planning to release it this summer (i.e., by the end of August), though there are a few things I'd like to polish up before doing so. In particular, I want to make the package less monolithic. When I release it I'll make announcements here and on the nlp@ list.
3) Chunking. Analyze each tagged token within a sentence and assemble compound tokens that express logical concepts. Define a custom grammar.
4) Extraction. Analyze each chunk and further tag the chunks as named entities, such as people, organizations, locations, etc.
Any ideas where to look for similar Haskell libraries?
I don't know of any work in these areas in Haskell (though I'd love to hear about it). You should try asking on the nlp@ list where the other linguists and NLPers are more likely to see it. -- Live well, ~wren

On Wed, Jul 6, 2011 at 8:32 PM, wren ng thornton
On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
Hi, Continuing my search of Haskell NLP tools and libs, I wonder if the following Haskell libraries exist (googling them does not help): 1) End of Sentence (EOS) Detection. Break text into a collection of meaningful sentences.
Depending on how you mean, this is either fairly trivial (for English) or an ill-defined problem. For things like determining whether the "." character is intended as a full stop vs part of an abbreviation; that's trivial.
But for general sentence breaking, how do you intend to deal with quotations? What about when news articles quote someone uttering a few sentences before the end-quote marker? So far as I'm aware, there's no satisfactory definition of what the solution should be in all reasonable cases. A "sentence" isn't really very well-defined in practice.
I am looking for Haskell implementation of sentence tokenizer such as described by Tibor Kiss and Jan Strunk’s in “Unsupervised Multilingual Sentence Boundary Detection”, which is implemented in NLTK: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt-module.html
2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to each token.
There are numerous approaches to this problem; do you care about the solution, or will any one of them suffice?
I've been working over the last year+ on an optimized HMM-based POS tagger/supertagger with online tagging and anytime n-best tagging. I'm planning to release it this summer (i.e., by the end of August), though there are a few things I'd like to polish up before doing so. In particular, I want to make the package less monolithic. When I release it I'll make announcements here and on the nlp@ list.
I am looking for some already working POS tagging framework that can be customized for different pidgin languages.
3) Chunking. Analyze each tagged token within a sentence and assemble compound tokens that express logical concepts. Define a custom grammar.
4) Extraction. Analyze each chunk and further tag the chunks as named entities, such as people, organizations, locations, etc.
Any ideas where to look for similar Haskell libraries?
I don't know of any work in these areas in Haskell (though I'd love to hear about it). You should try asking on the nlp@ list where the other linguists and NLPers are more likely to see it.
I will, though nlp@projects.haskell.org. looks very quiet...

On Wed, Jul 06, 2011 at 11:04:30PM +0400, Dmitri O.Kondratiev wrote:
On Wed, Jul 6, 2011 at 8:32 PM, wren ng thornton
wrote: On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
Hi, Continuing my search of Haskell NLP tools and libs, I wonder if the following Haskell libraries exist (googling them does not help): 1) End of Sentence (EOS) Detection. Break text into a collection of meaningful sentences.
Depending on how you mean, this is either fairly trivial (for English) or an ill-defined problem. For things like determining whether the "." character is intended as a full stop vs part of an abbreviation; that's trivial.
But for general sentence breaking, how do you intend to deal with quotations? What about when news articles quote someone uttering a few sentences before the end-quote marker? So far as I'm aware, there's no satisfactory definition of what the solution should be in all reasonable cases. A "sentence" isn't really very well-defined in practice.
I am looking for Haskell implementation of sentence tokenizer such as described by Tibor Kiss and Jan Strunk’s in “Unsupervised Multilingual Sentence Boundary Detection”, which is implemented in NLTK:
http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt-module.html
2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to each token.
There are numerous approaches to this problem; do you care about the solution, or will any one of them suffice?
I've been working over the last year+ on an optimized HMM-based POS tagger/supertagger with online tagging and anytime n-best tagging. I'm planning to release it this summer (i.e., by the end of August), though there are a few things I'd like to polish up before doing so. In particular, I want to make the package less monolithic. When I release it I'll make announcements here and on the nlp@ list.
I am looking for some already working POS tagging framework that can be customized for different pidgin languages.
3) Chunking. Analyze each tagged token within a sentence and assemble compound tokens that express logical concepts. Define a custom grammar.
4) Extraction. Analyze each chunk and further tag the chunks as named entities, such as people, organizations, locations, etc.
Any ideas where to look for similar Haskell libraries?
I don't know of any work in these areas in Haskell (though I'd love to hear about it). You should try asking on the nlp@ list where the other linguists and NLPers are more likely to see it.
I will, though nlp@projects.haskell.org. looks very quiet...
Quiet, yes, but, hey, we all start out… nevermind, humans start out loud. Well anyhow, it's quiet, but it's gotta start somewhere. I wouldn't hold my breath for a full-scale Haskell-native solution to your problem just yet though. Here's what I'm doing: I usually use external programs to do the heavy lifting for which there aren't Haskell programs. Then I use Haskell (where applicable) to do the logic, and shell scripts to glue together everything. So you'd use, say, UIMA+OpenNLP to do sentence boundaries, tokens, tags, named-entities whatnot, then spit out some annotated format, read it in with Haskell, and do the logic/magic there. Complicated, yes. But it gets me around having to code too much in Java. That's a gain if I've ever seen one. Regards, Aleks

On Wed, Jul 6, 2011 at 3:03 PM, Aleksandar Dimitrov
So you'd use, say, UIMA+OpenNLP to do sentence boundaries, tokens, tags, named-entities whatnot, then spit out some annotated format, read it in with Haskell, and do the logic/magic there.
Have you used that particular combination yet? I'd like to know the details of how you hooked everything together if that's something you can share. (We're working on a similar Frankenstein at the moment.) --Rogan

On Wed, Jul 06, 2011 at 03:14:07PM -0700, Rogan Creswick wrote:
Have you used that particular combination yet? I'd like to know the details of how you hooked everything together if that's something you can share. (We're working on a similar Frankenstein at the moment.)
These Frankensteins, as your so dearly call them, are always very task-specific. Here's a setup I've used: - Take some sort of corpus you want to work with, and annotate it with, say, Java tools. This will probably require you to massage the input corpus into something your tools can read, and then call the tools to process it - Let your Java stuff write everything to disk in a format that you can easily read in with Haskell. If your annotations are not interleaving, you're lucky, because you can probably just use a word-per-line with columns for markup format. That's trivial to read in with Haskell. More complicated stuff should probably be handled in XML-fashion. I like HXT for reading in XML, but it's slow (as are its competitors. Although it's been a while since I've used it; maybe it supports Text or ByteStrings by now.) - Advanced mode: instead of dumping to files, use named pipes or TCP sockets to transfer data. Good luck Shell scripting comes in *very* handy here, in order to automate this process. Now, everything I've done so far is only *research*, no finished product that the end user wants to poke on their desktop and have it work interactively. For that, it might be useful to have some sort of standing server architecture: you have multiple annotation servers (one that runs in Java, one that runs in Haskell) and have them communicate the data. At this point, the benefits might be outweighed by the drawbacks. My love for Haskell only goes that far. One hint, if you ever find yourself reading in quantitative linguistic data with Haskell: forget lazy IO. Forget strict IO, except your documents aren't ever bigger than a few hundred megs. In case you're not keeping the whole document in memory, but you're keeping some stuff in memory, never keep it around in ByteStrings, but use Text or SmallString (ByteStrings will invariably leak space in this scenario.) Learn how to use Iteratees and use them judiciously. Keep in touch on the Haskell NLP list :-) Regards, Aleks

On 7/07/2011, at 7:04 AM, Dmitri O.Kondratiev wrote:
I am looking for Haskell implementation of sentence tokenizer such as described by Tibor Kiss and Jan Strunk’s in “Unsupervised Multilingual Sentence Boundary Detection”, which is implemented in NLTK:
That method is multilingual but relies on the text being written using fairly modern Western conventions, and tackles the problem of "too many dots" and not knowing which are abbreviation points and which full stops. I don't suppose anyone knows something that might help with the problem of too few dots? Run on sentences are one example.
I've been working over the last year+ on an optimized HMM-based POS tagger/supertagger with online tagging and anytime n-best tagging. I'm planning to release it this summer (i.e., by the end of August), though there are a few things I'd like to polish up before doing so. In particular, I want to make the package less monolithic. When I release it I'll make announcements here and on the nlp@ list.
One of the issues I've had with a POS tagger I've been using is that it makes some really stupid decisions which can be patched up with a few simple rules, but since it's distributed as a .jar file I cannot add those rules.

On Wed, Jul 06, 2011 at 09:32:27AM -0700, wren ng thornton wrote:
On 7/6/11 9:27 AM, Dmitri O.Kondratiev wrote:
Hi, Continuing my search of Haskell NLP tools and libs, I wonder if the following Haskell libraries exist (googling them does not help): 1) End of Sentence (EOS) Detection. Break text into a collection of meaningful sentences.
Depending on how you mean, this is either fairly trivial (for English) or an ill-defined problem. For things like determining whether the "." character is intended as a full stop vs part of an abbreviation; that's trivial.
I disagree. It's not exactly trivial in the sense that it is solved. It is trivial in the sense that, usually, one would use a list of know abbreviations and just compare. This, however, just says that the most common approach is trivial, not that the problem is. There are cases where, for example, an abbreviation and a full stop will coincide. In these cases, you'll often need full-blown parsing or at least a well-trained maxent classifier. There are other problems: ordinals, acronyms which themselves also have periods in them, weird names (like Yahoo!) and initials, to name a few. This is only for English and similar languages, mind you.
But for general sentence breaking, how do you intend to deal with quotations? What about when news articles quote someone uttering a few sentences before the end-quote marker? So far as I'm aware, there's no satisfactory definition of what the solution should be in all reasonable cases. A "sentence" isn't really very well-defined in practice.
As long as you have one routine and stick to it, you don't need a formal definition every linguist will agree on. Computational Linguists (and their tools,) more often than not, just need a dependable solution, not a correct one.
2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to each token.
There are numerous approaches to this problem; do you care about the solution, or will any one of them suffice?
I've been working over the last year+ on an optimized HMM-based POS tagger/supertagger with online tagging and anytime n-best tagging. I'm planning to release it this summer (i.e., by the end of August), though there are a few things I'd like to polish up before doing so. In particular, I want to make the package less monolithic. When I release it I'll make announcements here and on the nlp@ list.
I'm very interested in your progress! Keep us posted :-) Regards, Aleks
participants (5)
-
Aleksandar Dimitrov
-
Dmitri O.Kondratiev
-
Richard O'Keefe
-
Rogan Creswick
-
wren ng thornton