NLP libraries and tools?

Hi, Please advise on NLP libraries similar to Natural Language Toolkit ( www.nltk.org) First of all I need: - tools to construct 'bag of words' ( http://en.wikipedia.org/wiki/Bag_of_words_model), which is a list of words in the article. - tools to prune common words, such as prepositions and conjunctions, as well as extremely rare words, such as the ones with typos. - stemming tools - Naive Bayes classifier - SVM classifier - k-means clustering Thanks!

On Fri, Jul 1, 2011 at 3:31 AM, Dmitri O.Kondratiev
Hi, Please advise on NLP libraries similar to Natural Language Toolkit
There is a (slowly?) growing NLP community for haskell over at: http://projects.haskell.org/nlp/ The nlp mailing list may be a better place to ask for details. To the best of my knowledge, most of the NLTK / OpenNLP capabilities have yet to be implemented/ported to Haskell, but there are some packages to take a look at on Hackage.
First of all I need: - tools to construct 'bag of words' (http://en.wikipedia.org/wiki/Bag_of_words_model), which is a list of words in the article.
This is trivially implemented if you have a natural language tokenizer you're happy with. Toktok might be worth looking at: http://hackage.haskell.org/package/toktok but I *think* it takes a pretty simple view of tokens (assume it is the tokenizer I've been using within the GF). Eric Kow (?) has a tokenizer implementation, which I can't seem to find at the moment - if I recall correctly, it is also very simple, but it would be a great place to implement a more complex tokenizer :)
- tools to prune common words, such as prepositions and conjunctions, as well as extremely rare words, such as the ones with typos.
I'm not sure what you mean by 'prune'. Are you looking for a stopword list to remove irrelevant / confusing words from something like a search query? (that's not hard to do with a stemmer and a set)
- stemming tools
There is an implementation of the porter stemmer on Hackage: - http://hackage.haskell.org/package/porter
- Naive Bayes classifier
I'm not aware of a general-purpose bayesian classifier lib. for haskell, but it *would* be great to have :) There are probably some general-purpose statistical packages that I'm unaware of that offer a larger set of capabilities...
- SVM classifier
There are a few of these. Take a look at the AI category on hackage: - http://hackage.haskell.org/packages/archive/pkg-list.html#cat:ai --Rogan
- k-means clustering

On Fri, Jul 1, 2011 at 9:34 PM, Rogan Creswick
On Fri, Jul 1, 2011 at 3:31 AM, Dmitri O.Kondratiev
wrote:> First of all I need:
...
- tools to construct 'bag of words' (http://en.wikipedia.org/wiki/Bag_of_words_model), which is a list of words in the article.
This is trivially implemented if you have a natural language tokenizer you're happy with.
Toktok might be worth looking at: http://hackage.haskell.org/package/toktok but I *think* it takes a pretty simple view of tokens (assume it is the tokenizer I've been using within the GF).
Unfortunately 'cabal install' fails with toktok: Building toktok-0.5... [1 of 7] Compiling Toktok.Stack ( Toktok/Stack.hs, dist/build/Toktok/Stack.o ) [2 of 7] Compiling Toktok.Sandhi ( Toktok/Sandhi.hs, dist/build/Toktok/Sandhi.o ) [3 of 7] Compiling Toktok.Trie ( Toktok/Trie.hs, dist/build/Toktok/Trie.o ) [4 of 7] Compiling Toktok.Lattice ( Toktok/Lattice.hs, dist/build/Toktok/Lattice.o ) [5 of 7] Compiling Toktok.Transducer ( Toktok/Transducer.hs, dist/build/Toktok/Transducer.o ) [6 of 7] Compiling Toktok.Lexer ( Toktok/Lexer.hs, dist/build/Toktok/Lexer.o ) [7 of 7] Compiling Toktok ( Toktok.hs, dist/build/Toktok.o ) Registering toktok-0.5... [1 of 1] Compiling Main ( Main.hs, dist/build/toktok/toktok-tmp/Main.o ) Linking dist/build/toktok/toktok ... [1 of 1] Compiling Main ( tools/ExtractLexicon.hs, dist/build/gf-extract-lexicon/gf-ex\ tract-lexicon-tmp/Main.o ) tools/ExtractLexicon.hs:5:35: Module `PGF' does not export `getLexicon' cabal: Error: some packages failed to install: toktok-0.5 failed during the building phase. The exception was: ExitFailure 1 Any ideas how to solve this?

On Fri, Jul 1, 2011 at 12:38 PM, Dmitri O.Kondratiev
On Fri, Jul 1, 2011 at 9:34 PM, Rogan Creswick
wrote: On Fri, Jul 1, 2011 at 3:31 AM, Dmitri O.Kondratiev
wrote:> First of all I need: Unfortunately 'cabal install' fails with toktok:
tools/ExtractLexicon.hs:5:35: Module `PGF' does not export `getLexicon' cabal: Error: some packages failed to install: toktok-0.5 failed during the building phase. The exception was: ExitFailure 1
Oh, right - I ran into this problem too, and forgot about it (I should have reported a bug...) I think this fails because of (relatively) recent changes in GF, which isn't constrained to specific versions in the toktok cabal file... --Rogan
Any ideas how to solve this?

On Fri, Jul 1, 2011 at 11:58 PM, Rogan Creswick
On Fri, Jul 1, 2011 at 12:38 PM, Dmitri O.Kondratiev
wrote: On Fri, Jul 1, 2011 at 9:34 PM, Rogan Creswick
wrote: On Fri, Jul 1, 2011 at 3:31 AM, Dmitri O.Kondratiev
wrote:> First of all I need: Unfortunately 'cabal install' fails with toktok:
tools/ExtractLexicon.hs:5:35: Module `PGF' does not export `getLexicon' cabal: Error: some packages failed to install: toktok-0.5 failed during the building phase. The exception was: ExitFailure 1
Oh, right - I ran into this problem too, and forgot about it (I should have reported a bug...) I think this fails because of (relatively) recent changes in GF, which isn't constrained to specific versions in the toktok cabal file...
--Rogan
Any other then 'toktok' Haskell word tokenizer that compiles and works? I need something like: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.regexp.WordPunctT...
Thanks!

On Fri, Jul 1, 2011 at 2:52 PM, Dmitri O.Kondratiev
Any other then 'toktok' Haskell word tokenizer that compiles and works? I need something like: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.regexp.WordPunctT...
I don't think this exists out of the box, but since it appears to be a basic regex tokenizer, you could use Data.List.Split to create one. (or one of the regex libraries may be able to do this more simply). If you go the Data.List.Split route, I suspect you'll want to create a Splitter based on the whenElt Splitter: http://hackage.haskell.org/packages/archive/split/0.1.1/doc/html/Data-List-S... which takes a function from an element to a bool. (which you can implement however you wish, possibly with a regular expression, although it will have to be pure.) If you want something like a maxent tokenizer, then you're currently out of luck :( (as far as I know). --Rogan

Hi, Continuing my search of Haskell NLP tools and libs, I wonder if the following Haskell libraries exist (googling them does not help): 1) End of Sentence (EOS) Detection. Break text into a collection of meaningful sentences. 2) Part-of-Speech (POS) Tagging. Assign part-of-speech information to each token. 3) Chunking. Analyze each tagged token within a sentence and assemble compound tokens that express logical concepts. Define a custom grammar. 4) Extraction. Analyze each chunk and further tag the chunks as named entities, such as people, organizations, locations, etc. Any ideas where to look for similar Haskell libraries?

Hi, Also a library for string normalization in the sense of stripping diacritical marks would be handy too. Does anything in this respect exist that would be usable from haskell? Thanks On Fri, Jul 01, 2011 at 02:31:34PM +0400, Dmitri O.Kondratiev wrote:
Hi, Please advise on NLP libraries similar to Natural Language Toolkit ( www.nltk.org) First of all I need: - tools to construct 'bag of words' ( http://en.wikipedia.org/wiki/Bag_of_words_model), which is a list of words in the article. - tools to prune common words, such as prepositions and conjunctions, as well as extremely rare words, such as the ones with typos. - stemming tools - Naive Bayes classifier - SVM classifier - k-means clustering
Thanks!
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Sun, Jul 10, 2011 at 12:59 PM, ivan vadovic
Hi,
Also a library for string normalization in the sense of stripping diacritical marks would be handy too. Does anything in this respect exist that would be usable from haskell?
The closest thing I know of is this: http://hackage.haskell.org/package/text-icu You still have to install ICU separately, that library is just a binding for working with it from Haskell. Jason
participants (4)
-
Dmitri O.Kondratiev
-
ivan vadovic
-
Jason Dagit
-
Rogan Creswick