
On Tue, May 31, 2011 at 11:30:06PM +0100, John Lato wrote:
I can't reproduce the space leak here. I tried Aleksander's original code, my iteratee version, the Ngrams version posted by Johan Tibell, and a lazy bytestring version.
I unfortunately can't post the actual corpus here, because it's copyrighted. But there's plenty of ways to retrieve large amounts of data from the Internet. See below.
f' :: Monad m => I.Iteratee S.ByteString m Wordcounts f' = I.joinI $ (enumLinesBS I.><> I.filter (not . S.null)) $ I.foldl' (\t s -> T.insertWith (+) s 1 t) T.empty
Neat, folding the specialised function into the enumeratee is nifty! One can tell that my experience with iteratee/enumerator has been only a day's worth for now :-\
None of these leak space for me (all compiled with ghc-7.0.3 -O2). Performance was pretty comparable for every version, although Aleksander's original did seem to have a very small edge.
How big were your input corpora?
Here's an absolutely evil shell script that is going to make me donate money to
project Gutenberg. It will gather a corpus that is still very small, but at
least realistic in its distributional properties.
+++ scrape.sh
#!/bin/sh
textfile=all_text.txt
touch $textfile
text=0
size=0
for i in `seq 10 300`; do
wget -q "http://www.gutenberg.org/files/$i/$i.zip"
if [ -f $i.zip ]; then
unzip -qq $i.zip
tr -sc '[[:alpha:]]' '\n' < $i.txt >> $textfile
text=`dc -e "$text 1 + p"`
size=`du -h $textfile | cut -f 1`
rm -f $i.zip $i.txt
fi
echo -n "\rFound $text of $i texts, total $size."
done
echo "\rFound $text texts, total $size"
+++
It'll take a while to run, and the resulting corpus is just a fraction of what
I'm using, but it serves well to illustrate the problem. If you want, you can
increase the amount of data gathered by simply tweaking the numerical range in
the seq statement. (If you make it gobble up more bandwidth, it might be polite
to put a sleep somewhere in the inner loop. I'd host the resulting file myself,
but I don't know if there aren't any conditions on that.)
If you did not tweak the script, it should've gathered some 100MB of data.
Running my unmodified original program, htop records 320MB of RAM usage towards
the end (*without* profiling being enabled.)
So it seems that I can't get rid of a factor of around 3x the input file size.
Luckily, the dependency seems to be linear. Here's some profiling:
<
I'd be happy to help track down a space leak in iteratee, but for now I'm not seeing one.
Thank you for your offer! Maybe I'm just seeing ghosts, and there is no space leak. But I really do think that the program is eating too much RAM. Regards, Aleks