Mathematics and Statistics libraries

I am a student currently interested in participating in Google Summer of Code. I have a strong interest in Haskell, and a semester's worth of coding experience in the language. I am a mathematics and cs double major with only a semester left and I am looking for information regarding what the community is lacking as far as mathematics and statistics libraries are concerned. If there is enough interest I would like to put together a project with this. I understand that such libraries are probably low priority, but if anyone has anything I would love to hear it. Thanks for reading, -Benjamin

I think such libraries are high priority!
My own experience with them is not deep, but I'll echo what I think is a
common observation:
- Matrix libraries are good
- Statistics libs need more work
And as far as wrappers around machine learning or computer vision libs
(openCV)... I'm not really sure about the status of those.
On Wed, Mar 21, 2012 at 1:24 PM, Ben Jones
I am a student currently interested in participating in Google Summer of Code. I have a strong interest in Haskell, and a semester's worth of coding experience in the language. I am a mathematics and cs double major with only a semester left and I am looking for information regarding what the community is lacking as far as mathematics and statistics libraries are concerned. If there is enough interest I would like to put together a project with this. I understand that such libraries are probably low priority, but if anyone has anything I would love to hear it.
Thanks for reading, -Benjamin
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 3/21/12 3:00 PM, Ryan Newton wrote:
I think such libraries are high priority!
My own experience with them is not deep, but I'll echo what I think is a common observation:
* Matrix libraries are good * Statistics libs need more work
I would also be very excited about a solid statistics proposal. The ticket Aleksey links to is a good start (as is the experience report linked from there), although I think that it would be possible to implement a core library with less type-trickery than he supposes. Such an interface wouldn't necessarily be perfectly statically safe, but other, tricker interfaces could be built on top of it (just as we have fancier type-level interfaces with statically checked dimensions on top of lower-level matrix libs, etc.). I envision a set of tools that let users get up and running with loading a dump of data and calculating a set of metrics on it with only a few lines. It should be designed such that the basic framework is easily extensible with various other analyses, and such that analyses compose fairly straightforwardly. Which indeed amounts to some Frame-type structure, and a core set of functions on it :-) --g

I'd like to see more statistics work, definitely. Bryan's statistics
library is excellent, but Ed Kmett has been talking about some very
interesting approaches to sampling from complicated distributions, which
I'd like to see implemented eventually in a library.
On Wed, Mar 21, 2012 at 1:24 PM, Ben Jones
I am a student currently interested in participating in Google Summer of Code. I have a strong interest in Haskell, and a semester's worth of coding experience in the language. I am a mathematics and cs double major with only a semester left and I am looking for information regarding what the community is lacking as far as mathematics and statistics libraries are concerned. If there is enough interest I would like to put together a project with this. I understand that such libraries are probably low priority, but if anyone has anything I would love to hear it.
Thanks for reading, -Benjamin
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 21.03.2012 21:24, Ben Jones wrote:
I am a student currently interested in participating in Google Summer of Code. I have a strong interest in Haskell, and a semester's worth of coding experience in the language. I am a mathematics and cs double major with only a semester left and I am looking for information regarding what the community is lacking as far as mathematics and statistics libraries are concerned. If there is enough interest I would like to put together a project with this. I understand that such libraries are probably low priority, but if anyone has anything I would love to hear it.
There is existing statistics related GSoC project[1]. It proposes implementation of analog of R's data frames. I think it's rather difficult since there is no obvious design. Also I think implementation will require a lot of type trickery [1] http://hackage.haskell.org/trac/summer-of-code/ticket/1596

If the goal is to help Haskell be a more acceptable choice for general
statistical analysis tasks, then hmatrix, statistics, and the various
gsl wrappers already provide the majority of the functionality needed.
I think the bigger problem is that there is no guidance on which
libraries are industrial strength, and there's no glue layer making it
easier to use the APIs you'd want to, and GHCi isn't always ideal as a
repl for this workflow.
If you're interested in UI work, ideally we'd have something similar
to RStudio as an environment, a simple set of windows encapsulating an
editor, a repl, a plotting panel and help/history, this sounds
superficial but it really has an impact when you're exploring a data
set and trying stuff out. However, it would be a bigger contribution
to get us to the point where we are able to just "import
Quant.Prelude" to bring into scope all the standard functionality
assumed in an environment like R or Matlab. In my experience most of
this can come from re-exporting existing libraries while occasionally
wrapping functions to simplify the interfaces and make them more
consistent (e.g., a quant doesn't particularly need to know why
Statistics.Sample.KernelDensity.kde uses unboxed vectors when the rest
of that lib uses Generic, and they certainly won't want to spend their
time remembering that they need to convert to call that function).
As an exercise, in GHCi, try loading a few arbitrary csv files of
tables including floating point columns, do a linear regression of one
such column on another, and then display a scatterplot with the
regression line, maybe throw in a check for the normality of the
residuals. Assume you'll need to be able to handle large data sets so
you need to use bytestring, attoparsec etc; beware that there's a
known bug that will cause a segfault/bus error if you use some
hmatrix/gsl functions from GHCi on x86_64, which is kind of a blocker
in itself. Maybe I missed something obvious but it took me a looong
time to figure out which containers, persistence + parsing, stats and
plotting packages I should choose.
I really disagree that we need a data frame type structure; they're an
abomination in R, they try to accommodate event records and time
series, and do neither well. Haskell records are fine for
inhomogeneous event series and for homogeneous time series parallel
Vectors or Matrices are better as they can be passed to BLAS and
LAPACK with consequent performance and clarity advantages - column
oriented storage rocks, and Haskell is already a good fit.
Having used C++, Matlab and R (the latter for quite a while) I now use
Haskell for all of my statistical analysis work, despite the many
shortcomings it's definitely worth it for the code clarity and type
checking, to say nothing of the pre-optimization performance and
robustness.
Best of luck, happy to share some preliminary code with you directly
if you're interested!
Tom
On 21 March 2012 17:24, Ben Jones
I am a student currently interested in participating in Google Summer of Code. I have a strong interest in Haskell, and a semester's worth of coding experience in the language. I am a mathematics and cs double major with only a semester left and I am looking for information regarding what the community is lacking as far as mathematics and statistics libraries are concerned. If there is enough interest I would like to put together a project with this. I understand that such libraries are probably low priority, but if anyone has anything I would love to hear it.
Thanks for reading, -Benjamin
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Tom Doris wrote:
If you're interested in UI work, ideally we'd have something similar to RStudio as an environment, a simple set of windows encapsulating an editor, a repl, a plotting panel and help/history, this sounds superficial but it really has an impact when you're exploring a data set and trying stuff out.
Concerning UI, the following project suggestion aims to give GHCi a web GUI http://hackage.haskell.org/trac/summer-of-code/ticket/1609 But one of your criteria is that a good UI should come with a help system, too, right? Best regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Hi Heinrich,
If we compare the GHCi experience with R or IPython, leaving aside any
GUIs, the help system they have at the repl level is just a lot more
intuitive and easy to use, and you get access to the full manual
entries. For example, compare what you see if you type :info sort into
GHCi versus ?sort in R. R gives you a view of the full docs for the
function, whereas in GHCi you just get the type signature.
I usually def a command to call out to ":!hoogle --info %", which
gives what you expect :info should. So, as is usually the case,
there's a solution in Haskell that matches the features in other
systems, but it's not the default and you have to invest effort
getting it set up right. This is fine for Haskell devs who do some
stats work, but it represents an offputtingly steep learning curve for
quants who are willing to learn a little Haskell but expect
(reasonably) some basic stuff like inline help to Just Work.
Tom
On 25 March 2012 08:26, Heinrich Apfelmus
Tom Doris wrote:
If you're interested in UI work, ideally we'd have something similar to RStudio as an environment, a simple set of windows encapsulating an editor, a repl, a plotting panel and help/history, this sounds superficial but it really has an impact when you're exploring a data set and trying stuff out.
Concerning UI, the following project suggestion aims to give GHCi a web GUI
http://hackage.haskell.org/trac/summer-of-code/ticket/1609
But one of your criteria is that a good UI should come with a help system, too, right?
Best regards, Heinrich Apfelmus
-- http://apfelmus.nfshost.com
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 25.03.2012 14:52, Tom Doris wrote:
Hi Heinrich,
If we compare the GHCi experience with R or IPython, leaving aside any GUIs, the help system they have at the repl level is just a lot more intuitive and easy to use, and you get access to the full manual entries. For example, compare what you see if you type :info sort into GHCi versus ?sort in R. R gives you a view of the full docs for the function, whereas in GHCi you just get the type signature.
Ingrating haddock documentation into GHCi would be really helpful but it's GSoC project on its own. For me most important difference between R's repl and GHCi is that :reload wipes all local binding. Effectively it forces to write everything in file and to avoid doing anything which couldn't be fitted into one-liner. It may not be bad but it's definitely different style And of course data visualization. Only library I know of is Chart[1] but I don't like API much. I think talking about data frames is a bit pointless unless we specify what is data frame. Basically there are two representations of tabular data structure: array of tuples or tuple of arrays. If you want first go for Data.Vector.Vector YourData. If you want second you'll probably end up with some HList-like data structure to hold arrays. [1] http://hackage.haskell.org/package/Chart

Hey All,
Theres actually a number of issues the come up with an effective dataframe-like for haskell, and data vis as well. (both of which I have some strong personal opinions on for haskell and which I'm exploring / experimenting with this spring). While folks have touched on a bunch, I just thought I'd put together my own opinions in the mix.
First of all: any good data manipulation (i.e. data frame -like ) library needs support for efficiently querying subsets of the data in various ways. Not just that, it really should provide coherent way of dealing with out of core data! From there you might want to ask the question: "do I want to iterate through chunks of the data" or "do i want to allow more general patterns of data access, and perhaps even ways to parallelize?". The basic thing (as others have remarked after this draft email got underway), you do essentially want to support some sql-like selection operations, and have them be efficient too, along with playing nice with columns of differing types
What sort of abstractions you provide are somewhat crucial, because that in turn affects how you can write algorithms! If you look closely, this is tantamount to saying that any sufficiently well designed (industrial grade) data frame lib for haskell might wind up leading into a model for supporting mapreduce or graphlab http://graphlab.org/ style algorithms in the multicore / not distributed regime, though a first version would pragmatically just provide an interface with sequentially chunked data and use pipes-core, or one of the other enumerator libraries. Theres also some need for the aforementioned fancy types for managing data, but that not even the real challenge (in my opinion). Probably the best lib to take ideas from is the python Pandas library, or at least thats my personal opinion.
Now in the space of data vis, probably the best example of a good library in terms of easy of getting informative (and pretty) outputs is ggplot2 (also in R). Now if you look there, you'll see that its VERY much integrated with the model fitting and data analysis functionality of R, and has a very compositional approach which could easily be ported pretty directly over to haskell.
However, as with a good data frame-like, certain obstacles come up partly because if we insist a type safe way to do things while being at least as high level as R or python, the absence of row types for frame column names makes specifying linear models that are statically well formed (as in only referencing column names that are actually in the underlying data frame) bit tricky, and while there are approaches that do work some of the time, theres not really a good general purpose way (as far as I can tell) for that small problem of trying to resolve names as early as possible. Or at the very least I don't see a simple approach that i'm happy with.
these can be summarized I think as follows:
Any "practical" data frame lib needs to interact well with out of core data, and ideally also simplify the task of writing algorithms on top in a way that sort of gives out of core goodness for free. Theres a lot of different ways this can be perhaps done under the covers, perhaps using one of the libraries like reducers, enumerator or pipes core, but it really should be invisible for the client algorithms author, or at least invisible by default. And more over I think any attack in that direction is essentially a precursor to sorting out map-reduce and graph lab like tools for haskell.
Any really nice high level data vis tool really needs to have some data analysis / machine learning style library that its working with, and this is probably best understood by looking at things already out there, such as ggplot2 in R
that said, I'm all ears for other folks takes on this, especially since I'm spending some time this spring experimenting in both these directions.
cheers
-Carter
On Sun, Mar 25, 2012 at 9:54 AM, Aleksey Khudyakov
On 25.03.2012 14 (tel:25.03.2012%2014):52, Tom Doris wrote:
Hi Heinrich,
If we compare the GHCi experience with R or IPython, leaving aside any GUIs, the help system they have at the repl level is just a lot more intuitive and easy to use, and you get access to the full manual entries. For example, compare what you see if you type :info sort into GHCi versus ?sort in R. R gives you a view of the full docs for the function, whereas in GHCi you just get the type signature.
Ingrating haddock documentation into GHCi would be really helpful but it's GSoC project on its own.
For me most important difference between R's repl and GHCi is that :reload wipes all local binding. Effectively it forces to write everything in file and to avoid doing anything which couldn't be fitted into one-liner. It may not be bad but it's definitely different style
And of course data visualization. Only library I know of is Chart[1] but I don't like API much.
I think talking about data frames is a bit pointless unless we specify what is data frame. Basically there are two representations of tabular data structure: array of tuples or tuple of arrays. If you want first go for Data.Vector.Vector YourData. If you want second you'll probably end up with some HList-like data structure to hold arrays.
[1] http://hackage.haskell.org/package/Chart
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org (mailto:Haskell-Cafe@haskell.org) http://www.haskell.org/mailman/listinfo/haskell-cafe

Tom Doris
If you're interested in UI work, ideally we'd have something similar to RStudio as an environment, a simple set of windows encapsulating an editor, a repl, a plotting panel and help/history, this sounds superficial but it really has an impact when you're exploring a data set and trying stuff out.
I agree, this sounds really nice.
I really disagree that we need a data frame type structure; they're an abomination in R, they try to accommodate event records and time series, and do neither well.
Just to clarify (since I think the original suggestion was mine), I don't want to copy R's data frame (which I never quite understood, anyway), but I'd like some standardized data structure, ideally with an option to label columns, and functions to slice and join. The underlying structure can just be a list of columns (Vector) or whatever. -k -- If I haven't seen further, it is by standing in the footprints of giants

On 26/03/2012, at 8:35 PM, Ketil Malde wrote:
Just to clarify (since I think the original suggestion was mine), I don't want to copy R's data frame (which I never quite understood, anyway)
A data.frame is - a record of vectors all the same length - which can be sliced and diced like a 2d matrix It's not unlike an SQL table (think of a column-oriented data base so a table is really a collection of named columns, but it _looks_ like a collection of rows).
participants (10)
-
Aleksey Khudyakov
-
Ben Jones
-
Carter Tazio Schonwald
-
Daniel Peebles
-
Gershom Bazerman
-
Heinrich Apfelmus
-
Ketil Malde
-
Richard O'Keefe
-
Ryan Newton
-
Tom Doris