Mining Twitter data in Haskell and Clojure

14 Jun 2010

      I'm computing a communication graph from Twitter data and then scan it
daily to allocate social capital to nodes behaving  in a good karmic
manner.  The graph is culled from 100 million tweets and has about 3
million nodes. First I wrote the simulation of the 35 days of data in
Clojure and then translated it into Haskell with great help from the
glorious #haskell folks.  I had to add -A5G -K5G to make it work.  It
does 10 days OK hovering at 57 GB of RAM; I include profiling of that
in sc10days.prof.

At first the Haskell executable goes faster than Clojure, not by an
order of magnitude, but by 2-3 times per day simulated.  (Clojure also
fits well in its 32 GB JVM with compressed references.)  However,
Haskell gets stuck after a while, and for good.  Clearly I'm not doing
Haskell optimally here, and would appreciate optimization advice.
Here's the code:

http://github.com/alexy/husky

The data and problem description is in

http://github.com/alexy/husky/blob/master/Haskell-vs-Clojure-Twitter.md

-- also referred from the main README.md.

The main is in sc.hs, and the algorithm is in SocRun.hs.  The original
Clojure is in socrun.clj.  This is a continuation of active Twitter
research and the results will be published, and I'd really like to
make Haskell work at this scale and beyond!  The seq's sprinkled
already did no good.  I ran under ghc 6.10 with -O2 with or without -
fvia-C, with no difference in stallling, and am working to bring 6.12
to bear now.

Cheers,
Alexy

John Van Enk

Nicolas Pouillard

wren ng thornton

wren ng thornton

Johan Tibell

Ron de Bruijn

tags

participants (15)