2011/7/23 Eric Rasmussen <ericrasmussen@gmail.com>

Hi Felipe,

Thank you for the very detailed explanation and help. Regarding the first point, for this particular use case it's fine if the user-specified file size is extended by the length of a partial line (it's a compact csv file so if the user breaks a big file into 100mb chunks, each chunk would only ever be about 100mb + up to 80 bytes, which is fine for the user).

I'm intrigued by the idea of making the bulk copy function with EB.isolate and EB.iterHandle, but I couldn't find a way to fit these into the larger context of writing to multiple file handles. I'll keep working on it and see if I can address the concerns you brought up.

Thanks again!
Eric

On Fri, Jul 22, 2011 at 6:00 PM, Felipe Almeida Lessa <felipe.lessa@gmail.com> wrote:

There is one problem with your algorithm. If the user asks for 4 GiB,
then the program will create files with *at least* 4 GiB. So the user
would need to ask for less, maybe 3.9 GiB. Even so there's some
danger, because there could be a 0.11 GiB line on the file.

Now, the biggest problem your code won't run in constant memory.
'EB.take' does not lazily return a lazy ByteString. It strictly
returns a lazy ByteString [1]. The lazy ByteString is used to avoid
copying data (as it is basically the same as a linked list of strict
bytestrings). So if the user asked for 4 GiB files, this program
would need at least 4 GiB of memory, probably more due to overheads.

If you want to use lazy lazy ByteStrings (lazy ByteStrings with lazy
I/O, as oposed to lazy ByteStrings with strict I/O), the enumerator
package doesn't really buy you anything. You should just use
bytestring package's lazy I/O functions.

If you want the guarantee of no leaks that enumerator gives, then you
have to use another way of constructing your program. One safe way of
doing it is something like:

takeNextLine :: E.Iteratee B.ByteString m (Maybe L.ByteString)
takeNextLine = ...

go :: Monad m => Handle -> Int64 -> E.Iteratee B.ByteString m (Maybe
L.ByteString)
go h n = do
mline <- takeNextLine
case mline of
Nothing -> return Nothing
Just line
| L.length line <= n -> L.hPut h line >> go h (n - L.length line)
| otherwise -> return mline

So 'go h n' is the iteratee that saves at most 'n' bytes in handle 'h'
and returns the leftover data. The driver code needs to check its
results. Case 'Nothing', then the program finishes. Case 'Just
line', save line on a new file and call 'go h2 (n - L.length line)'.
It isn't efficient because lines could be small, resulting in many
small hPuts (bad). But it is correct and will never use more than 'n'
bytes (great). You could also have some compromise where the user
says that he'll never have lines longer than 'x' bytes (say, 1 MiB).
Then you call a bulk copy function for 'n - x' bytes, and then call
'go h x'. I think you can make the bulk copy function with EB.isolate
and EB.iterHandle.

Cheers, =)

[1] http://hackage.haskell.org/packages/archive/enumerator/0.4.13.1/doc/html/src/Data-Enumerator-Binary.html#take

--
Felipe.

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe