API for reading a big binary file

Hi -- I'm still learning to use ghc effectively, and I'm trying to use it for more-or-less mundane programming tasks, hopefully in a way which is a lot more elegant than I'd do in Java or Perl. I've got a big [around a gigabyte] binary file, filled with identical binary structures (imagine a C process writing structs). I'd like to process/analyze them efficiently. In C or even Java, i'd memory map the file and extract the data I need. Is there a fast way to do this using ghc? I can extract fields by using a ByteString, but I may not be using it fast enough: I've had to write my own routines to extract ints, longs and doubles. The other option is to write the loading code largely in C, but I'm not clear how to write a data type which maps well to a C struct. Any help / examples would really be appreciated. Thanks, Ranjan

On Thu, Dec 21, 2006 at 01:47:48PM -0800, Ranjan Bagchi wrote:
I've got a big [around a gigabyte] binary file, filled with identical binary structures (imagine a C process writing structs). I'd like to process/analyze them efficiently. In C or even Java, i'd memory map the file and extract the data I need.
Are you sure you want a memory map? (Disclamer: I am only familiar with the Linux VM.) * IO is usually (drum roll) IO bound. CPU performance isn't a big deal. * By using memory mapping, you limit yourself to the largest consecutive chunk of free address space, which is at most 3GB on 32-bit Linux. * Memory mapping doesn't work on pipes - with today's CPU and disk speeds, zcat is often faster than reading a file. * Haskell is not (yet!) powerful enough to statically check normal array access, so you'll be paying for lots of bounds checks. * mmap's biggest performance advantage, the ability to use disk cache pages in place, is probably lost when your dataset doesn't fit into cache. That said, if you actually need memory mapping, it shouldn't be too painful.
Is there a fast way to do this using ghc? I can extract fields by using a ByteString, but I may not be using it fast enough: I've had to write my own routines to extract ints, longs and doubles.
* Define an instance of Storable. If you are feeling altruistic, get a copy of DrIFT and add support for Storable. * Use Data.Array.Storable. This provides a mutable array interface to a pointer-to-array-of-struct. * foreign import ccall "mmap" unsafe c_mmap :: Ptr a -> CSize -> CInt -> CInt -> CInt -> COff -> IO (Ptr a) -- use the FFI to access mmap(2), AFAIK there is no standard interfact to this.
Any help / examples would really be appreciated.
The sources for the standard libraries are generally a good source for system interfacing questions.

On Thu, Dec 21, 2006 at 01:47:48PM -0800, Ranjan Bagchi wrote:
Is there a fast way to do this using ghc? I can extract fields by using a ByteString, but I may not be using it fast enough: I've had to write my own routines to extract ints, longs and doubles.
The other option is to write the loading code largely in C, but I'm not clear how to write a data type which maps well to a C struct.
I'm not sure how fast it is, but NewBinary might fit you needs? Hopefully this the most current link http://j.mongers.org/pub/haskell/NewBinary/ Marc
participants (3)
-
Marc Weber
-
Ranjan Bagchi
-
Stefan O'Rear