
On Fri, Jul 1, 2011 at 3:31 AM, Dmitri O.Kondratiev
Hi, Please advise on NLP libraries similar to Natural Language Toolkit
There is a (slowly?) growing NLP community for haskell over at: http://projects.haskell.org/nlp/ The nlp mailing list may be a better place to ask for details. To the best of my knowledge, most of the NLTK / OpenNLP capabilities have yet to be implemented/ported to Haskell, but there are some packages to take a look at on Hackage.
First of all I need: - tools to construct 'bag of words' (http://en.wikipedia.org/wiki/Bag_of_words_model), which is a list of words in the article.
This is trivially implemented if you have a natural language tokenizer you're happy with. Toktok might be worth looking at: http://hackage.haskell.org/package/toktok but I *think* it takes a pretty simple view of tokens (assume it is the tokenizer I've been using within the GF). Eric Kow (?) has a tokenizer implementation, which I can't seem to find at the moment - if I recall correctly, it is also very simple, but it would be a great place to implement a more complex tokenizer :)
- tools to prune common words, such as prepositions and conjunctions, as well as extremely rare words, such as the ones with typos.
I'm not sure what you mean by 'prune'. Are you looking for a stopword list to remove irrelevant / confusing words from something like a search query? (that's not hard to do with a stemmer and a set)
- stemming tools
There is an implementation of the porter stemmer on Hackage: - http://hackage.haskell.org/package/porter
- Naive Bayes classifier
I'm not aware of a general-purpose bayesian classifier lib. for haskell, but it *would* be great to have :) There are probably some general-purpose statistical packages that I'm unaware of that offer a larger set of capabilities...
- SVM classifier
There are a few of these. Take a look at the AI category on hackage: - http://hackage.haskell.org/packages/archive/pkg-list.html#cat:ai --Rogan
- k-means clustering