
I have an awkward programming problem -- I need to take a dictionary, parse it, build a bunch of intermediate lists and then make maps and tries out of the list. A "programming problem" because it's taken me a fair amount of effort to pull together the parser and list generator -- and "awkward" because a 69000 item list, [(String, [(String, String)])], does not compile under GHC (stack overflow). (It's not likely to compile under anything else, either!) Members of #haskell have urged me to load the offending material from a file; indeed, it only takes ten seconds to parse the dictionary and build the lists, sort them and dump them back out again -- but it offends my sensibilities. I don't want to wait ten seconds to load my dictionary every time. I could use the FFI, then I can make the trie and lists all in C. That'd work great. My list likely uses too much RAM now, anyways. I'm considering one other option though -- I wonder if I can build large constants in GHC Core? If anybody has tried it -- or found some other way to make big huge constants in Haskell -- I would sure like to know about it. On another note, I am extremely curious about the difference between statically compiling a list and building it at runtime. I find it hard to wrap my head around the fact that I can build the list at runtime in a short time, but can not compile it without eating all of my machine's RAM. Is it due to laziness, maybe? Well, no -- because I subject the whole list to a sort. It's not just streaming records in from one IO handle and out another. -- _jsn

jason.dusek:
I have an awkward programming problem -- I need to take a dictionary, parse it, build a bunch of intermediate lists and then make maps and tries out of the list. A "programming problem" because it's taken me a fair amount of effort to pull together the parser and list generator -- and "awkward" because a 69000 item list, [(String, [(String, String)])], does not compile under GHC (stack overflow). (It's not likely to compile under anything else, either!)
Members of #haskell have urged me to load the offending material from a file; indeed, it only takes ten seconds to parse the dictionary and build the lists, sort them and dump them back out again -- but it offends my sensibilities. I don't want to wait ten seconds to load my dictionary every time.
I could use the FFI, then I can make the trie and lists all in C. That'd work great. My list likely uses too much RAM now, anyways.
I'm considering one other option though -- I wonder if I can build large constants in GHC Core? If anybody has tried it -- or found some other way to make big huge constants in Haskell -- I would sure like to know about it.
You can build large constant bytestrings, fwiw. They turn into an Addr#, and GHC will leave them alone. See, e.g. this regex testsuite, which has *lots* of bytestring (overloaded) literals, http://code.haskell.org/~dons/code/pcre-light/tests/Unit.hs Using the OverloadedStrings pragma. This is approximately the same approach as Alex (the lexer generator) takes with its lexing tables stored in an unboxed string literal. -- Don

Don Stewart
You can build large constant bytestrings, fwiw. They turn into an Addr#, and GHC will leave them alone.
Well, for my particular problem -- I guess I could align all the elements of the lists, and then build the trie and maps from the ByteStrings at startup. -- _jsn

jason.dusek:
Don Stewart
wrote: You can build large constant bytestrings, fwiw. They turn into an Addr#, and GHC will leave them alone.
Well, for my particular problem -- I guess I could align all the elements of the lists, and then build the trie and maps from the ByteStrings at startup.
You could pull them out with Data.Binary too, making it a little faster constructing. -- Don

Jason Dusek jason.dusek@gmail.com:
I have an awkward programming problem -- I need to take a dictionary, parse it, build a bunch of intermediate lists and then make maps and tries out of the list. A "programming problem" because it's taken me a fair amount of effort to pull together the parser and list generator -- and "awkward" because a 69000 item list, [(String, [(String, String)])], does not compile under GHC (stack overflow).
I also have constants that are too large to compile. I am resigned to loading them from data files--other solutions seem even worse. With data files: - The program starts up more slowly. - The code is more complex. - There may be some hassles with laziness. - Distributing the executable is not as simple. Data.Binary eases the irritation somewhat. Jay

jay:
Jason Dusek jason.dusek@gmail.com:
I have an awkward programming problem -- I need to take a dictionary, parse it, build a bunch of intermediate lists and then make maps and tries out of the list. A "programming problem" because it's taken me a fair amount of effort to pull together the parser and list generator -- and "awkward" because a 69000 item list, [(String, [(String, String)])], does not compile under GHC (stack overflow).
I also have constants that are too large to compile. I am resigned to loading them from data files--other solutions seem even worse.
With data files:
- The program starts up more slowly. - The code is more complex. - There may be some hassles with laziness. - Distributing the executable is not as simple.
Data.Binary eases the irritation somewhat.
Did you try bytestring literals (and maybe parsing them in-memory with Data.Binary)? -- Don

Don Stewart dons@galois.com:
jay:
I also have constants that are too large to compile. I am resigned to loading them from data files--other solutions seem even worse. ... Data.Binary eases the irritation somewhat.
Did you try bytestring literals (and maybe parsing them in-memory with Data.Binary)?
That didn't occur to me, since neither of my large constants includes strings.... I think you're suggesting that each constant could appear in the source as a long bytestring and be deserialized into the data structure. If that works, it should improve the startup time, but it's still not as nice as simply compiling it straight up. I'll try it. Jay

Bryan O'Sullivan
The trick I usually use in cases like this is to compile the data as C code and link against it, then access it from Haskell via a Ptr.
For my particular application, I really need to ship a single static binary that has it all -- data as well as algorithms -- so I'm going with the FFI. It's too bad that I end up working in the IO monad much of the time. I hope we'll have massive static constants someday soon! -- _jsn

Jason Dusek jason.dusek@gmail.com:
Bryan O'Sullivan
wrote: The trick I usually use in cases like this is to compile the data as C code and link against it, then access it from Haskell via a Ptr.
For my particular application, I really need to ship a single static binary that has it all -- data as well as algorithms -- so I'm going with the FFI. It's too bad that I end up working in the IO monad much of the time. I hope we'll have massive static constants someday soon!
unsafePerformIO should be safe on constants, right? It has worked for me, at least. Jay

On Sat, Mar 1, 2008 at 12:24 PM, Jay Scott
Jason Dusek jason.dusek@gmail.com: unsafePerformIO should be safe on constants, right? It has worked for me, at least.
Unfortunately, reading in the list does not allow me to ship a single binary. It's not file reading, though, that forces me in to the IO Monad -- it's pointer dereferencing. -- _jsn

jay:
Don Stewart dons@galois.com:
jay:
I also have constants that are too large to compile. I am resigned to loading them from data files--other solutions seem even worse. ... Data.Binary eases the irritation somewhat.
Did you try bytestring literals (and maybe parsing them in-memory with Data.Binary)?
That didn't occur to me, since neither of my large constants includes strings.... I think you're suggesting that each constant could appear in the source as a long bytestring and be deserialized into the data structure. If that works, it should improve the startup time, but it's still not as nice as simply compiling it straight up.
I'll try it.
Here's an example, which stores a Data.Map in a gzip-compressed bytestring literal (a C
string literal in the compiled code). The Map is reconstructed on
startup.
{-# LANGUAGE OverloadedStrings #-}
import Data.Binary
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as S
import Data.ByteString.Lazy
import Codec.Compression.GZip
--
-- this is a gzip compressed literal bytestring, storing a binary-encoded Data.Map
--
mytable =
"\US\139\b\NUL\NUL\NUL\NUL\NUL\NUL\ETXEN\219\SO\194 \f\197\224\188\196\CAN\227\US\224\171~\NAKc\GS4ce\161`\178\191\215(\176\190\180\167\231\210\n\241\171\203\191\ti\157\217\149\249< \ENQ\214\&9>\202\162\179a\132X\233\ESC=\231\215\164\SYN\157\DC2D\226*\146\174o\t\167\DLE\209\"i_\240\193\129\199

Don Stewart dons@galois.com:
jay:
Don Stewart dons@galois.com:
jay:
I also have constants that are too large to compile. I am resigned to loading them from data files--other solutions seem even worse. ... Data.Binary eases the irritation somewhat.
Did you try bytestring literals (and maybe parsing them in-memory with Data.Binary)?
That didn't occur to me, since neither of my large constants includes strings.... I think you're suggesting that each constant could appear in the source as a long bytestring and be deserialized into the data structure. If that works, it should improve the startup time, but it's still not as nice as simply compiling it straight up.
I'll try it.
Here's an example, which stores a Data.Map in a gzip-compressed bytestring literal (a C string literal in the compiled code). The Map is reconstructed on startup.
{-# LANGUAGE OverloadedStrings #-}
import Data.Binary import qualified Data.Map as M import qualified Data.ByteString.Char8 as S import Data.ByteString.Lazy import Codec.Compression.GZip
-- -- this is a gzip compressed literal bytestring, storing a binary- encoded Data.Map -- mytable = "\US\139\b\NUL\NUL\NUL\NUL\NUL\NUL\ETXEN\219\SO\194 \f\197\224 \188\196\CAN\227\US\224\171~\NAKc\GS4ce\161`\178\191\215(\176\190\180\167 \231\210\n\241\171\203\191\ti\157\217\149\249< \ENQ\214\&9>\202\162\179a \132X\233\ESC=\231\215\164\SYN\157\DC2D\226*\146\174o\t\167\DLE\209\"i_ \240\193\129\199
main = print =<< M.lookup "ghc" m where -- build the table from the bytestring: m :: M.Map String (Maybe String) m = decode . decompress . fromChunks . return $ mytable
Running it:
$ ./A Just "dinosaur!"
:)
Important to use a bytestring, since that gets compiled to a C string literal (and not messed with by the simplifier).
-- Don

Don Stewart dons@galois.com:
jay:
Don Stewart dons@galois.com:
jay:
I also have constants that are too large to compile. I am resigned to loading them from data files--other solutions seem even worse. ... Data.Binary eases the irritation somewhat.
Did you try bytestring literals (and maybe parsing them in-memory with Data.Binary)?
I finally squeezed enough time to try it, and it didn't work for me. -- ghc Overflow.hs [1 of 1] Compiling Overflow ( Overflow.hs, Overflow.o ) Overflow.hs:8:10:stack overflow: use +RTS -K<size> to increase it -- where Overflow.hs is in the vicinity of 40M and looks like -- {-# LANGUAGE OverloadedStrings #-} module Overflow where import qualified Data.ByteString.Lazy as S bigData :: S.ByteString bigData = "\0\0\0\0\0\5\67\195\0\0\0\0... -- I didn't compress it, because Codec.Compression.GZip didn't compile for me. It looked like a library change since 6.6 broke it. Is there a handy string escaping function in the libraries somewhere? It only took a minute to write one, and I spent longer than that looking, so maybe it's the wrong question.... Surely it's in there somewhere, and I'm just 2 dum 2 c. Jay

jay:
Don Stewart dons@galois.com:
jay:
Don Stewart dons@galois.com:
jay:
I also have constants that are too large to compile. I am resigned to loading them from data files--other solutions seem even worse. ... Data.Binary eases the irritation somewhat.
Did you try bytestring literals (and maybe parsing them in-memory with Data.Binary)?
I finally squeezed enough time to try it, and it didn't work for me.
-- ghc Overflow.hs [1 of 1] Compiling Overflow ( Overflow.hs, Overflow.o )
Enable optimisations! Compile with ghc -O2. You need this to avoid having a very slow pack call at runtime.
Overflow.hs:8:10:stack overflow: use +RTS -K<size> to increase it --
where Overflow.hs is in the vicinity of 40M and looks like
-- {-# LANGUAGE OverloadedStrings #-}
module Overflow where
import qualified Data.ByteString.Lazy as S
bigData :: S.ByteString bigData = "\0\0\0\0\0\5\67\195\0\0\0\0... --
I didn't compress it, because Codec.Compression.GZip didn't compile for me. It looked like a library change since 6.6 broke it.
Probably you don't have the zlib.h header? Or make sure you have the latest version of zlib from hackage -- it does work.
Is there a handy string escaping function in the libraries somewhere? It only took a minute to write one, and I spent longer than that looking, so maybe it's the wrong question.... Surely it's in there somewhere, and I'm just 2 dum 2 c.
The show function?

Don Stewart dons@galois.com:
jay:
Don Stewart dons@galois.com:
jay:
Don Stewart dons@galois.com:
jay:
I also have constants that are too large to compile. I am resigned to loading them from data files--other solutions seem even worse. ... Data.Binary eases the irritation somewhat.
Did you try bytestring literals (and maybe parsing them in-memory with Data.Binary)?
I finally squeezed enough time to try it, and it didn't work for me.
-- ghc Overflow.hs [1 of 1] Compiling Overflow ( Overflow.hs, Overflow.o )
Enable optimisations! Compile with ghc -O2. You need this to avoid having a very slow pack call at runtime.
Yes, I tried basic variations like that. The result is the same with -O1 or with -O2, and with Data.ByteString or Data.ByteString.Lazy .
I'm just 2 dum 2 c.
The show function?
Ha ha! Jay

jay:
Don Stewart dons@galois.com:
jay:
Don Stewart dons@galois.com:
jay:
Don Stewart dons@galois.com:
jay: > I also have constants that are too large to compile. I am resigned to > loading them from data files--other solutions seem even worse. ... > Data.Binary eases the irritation somewhat.
Did you try bytestring literals (and maybe parsing them in-memory with Data.Binary)?
I finally squeezed enough time to try it, and it didn't work for me.
-- ghc Overflow.hs [1 of 1] Compiling Overflow ( Overflow.hs, Overflow.o )
Enable optimisations! Compile with ghc -O2. You need this to avoid having a very slow pack call at runtime.
Yes, I tried basic variations like that. The result is the same with -O1 or with -O2, and with Data.ByteString or Data.ByteString.Lazy .
Ok, hmm, that really shouldn't be the case. Do you have the example available somewhere? It's just a 40M inline bytestring? -- Don

Is it _possible_ to use Template Haskell to take the name of the external binary file and produce such a bytestring literal?

| On another note, I am extremely curious about the difference | between statically compiling a list and building it at | runtime. I find it hard to wrap my head around the fact that I | can build the list at runtime in a short time, but can not | compile it without eating all of my machine's RAM. It's not quite as stupid as it sounds. The run-time version is like an *interpreter*: you write an interpreter (a parser in fact) that interprets the byte-strings you read from disk, and builds in-heap data. The compile time version is, well, a compiler, and generates statically-allocated data. This may be a lot bigger than the interpreter, of course, and it all has to be stuffed through all the compiler phases. That should certainly be possible, but it's just something that GHC is not well optimised for. We are actively working on the more egregious manifestations thought: see http://hackage.haskell.org/trac/ghc/ticket/2002. Anyway, have another go when you see that #2002 is fixed. Simon

jason.dusek:
I have an awkward programming problem -- I need to take a dictionary, parse it, build a bunch of intermediate lists and then make maps and tries out of the list. A "programming problem" because it's taken me a fair amount of effort to pull together the parser and list generator -- and "awkward" because a 69000 item list, [(String, [(String, String)])], does not compile under GHC (stack overflow). (It's not likely to compile under anything else, either!)
Here's an example of the approach Bryan outlined, which does seem to work for files as large as gcc can handle: * generate your big Haskell Map * serialise it with Data.Binary, and Codec.Compression.GZip to a file * compile the data into a C const array, and link that into Haskell * decode it on startup, ressurecting the Haskell data. The C source looks like: const uint8_t beowulf[] = { 31, 139, 8, 0, 0, 0, 0, 0, 0, 3, 124, 189, 75, 150, 46, 54, 93, 193, 96, 144, 241, 168, 172, 238, 214, 0, ... http://code.haskell.org/~dons/code/compiled-constants/cbits/constants.c which is the gzip, Data.Binary encoded version of a Map ByteString Int. Then the Haskell code need only access this array as a Ptr Word8, wrap that as a Bytestring, then run Data.Binary over the result to rebuild the Map. As you can see here: http://code.haskell.org/~dons/code/compiled-constants/Constants.hs I've put a couple of examples of how to access C-side serialised Haskell values in a package here: http://code.haskell.org/~dons/code/compiled-constants/ Cheers, Don
participants (6)
-
Bryan O'Sullivan
-
Chris Kuklewicz
-
Don Stewart
-
Jason Dusek
-
Jay Scott
-
Simon Peyton-Jones