
If you used Data.Enumerator.Text, you would maybe benefit the "lines"
function:
lines :: Monad m => Enumeratee Text Text m b
But there is something I don't get with that signature:
why isn't it:
lines :: Monad m => Enumeratee Text [Text] m b
??
2011/7/23 Eric Rasmussen
Hi Felipe,
Thank you for the very detailed explanation and help. Regarding the first point, for this particular use case it's fine if the user-specified file size is extended by the length of a partial line (it's a compact csv file so if the user breaks a big file into 100mb chunks, each chunk would only ever be about 100mb + up to 80 bytes, which is fine for the user).
I'm intrigued by the idea of making the bulk copy function with EB.isolate and EB.iterHandle, but I couldn't find a way to fit these into the larger context of writing to multiple file handles. I'll keep working on it and see if I can address the concerns you brought up.
Thanks again! Eric
On Fri, Jul 22, 2011 at 6:00 PM, Felipe Almeida Lessa < felipe.lessa@gmail.com> wrote:
There is one problem with your algorithm. If the user asks for 4 GiB, then the program will create files with *at least* 4 GiB. So the user would need to ask for less, maybe 3.9 GiB. Even so there's some danger, because there could be a 0.11 GiB line on the file.
Now, the biggest problem your code won't run in constant memory. 'EB.take' does not lazily return a lazy ByteString. It strictly returns a lazy ByteString [1]. The lazy ByteString is used to avoid copying data (as it is basically the same as a linked list of strict bytestrings). So if the user asked for 4 GiB files, this program would need at least 4 GiB of memory, probably more due to overheads.
If you want to use lazy lazy ByteStrings (lazy ByteStrings with lazy I/O, as oposed to lazy ByteStrings with strict I/O), the enumerator package doesn't really buy you anything. You should just use bytestring package's lazy I/O functions.
If you want the guarantee of no leaks that enumerator gives, then you have to use another way of constructing your program. One safe way of doing it is something like:
takeNextLine :: E.Iteratee B.ByteString m (Maybe L.ByteString) takeNextLine = ...
go :: Monad m => Handle -> Int64 -> E.Iteratee B.ByteString m (Maybe L.ByteString) go h n = do mline <- takeNextLine case mline of Nothing -> return Nothing Just line | L.length line <= n -> L.hPut h line >> go h (n - L.length line) | otherwise -> return mline
So 'go h n' is the iteratee that saves at most 'n' bytes in handle 'h' and returns the leftover data. The driver code needs to check its results. Case 'Nothing', then the program finishes. Case 'Just line', save line on a new file and call 'go h2 (n - L.length line)'. It isn't efficient because lines could be small, resulting in many small hPuts (bad). But it is correct and will never use more than 'n' bytes (great). You could also have some compromise where the user says that he'll never have lines longer than 'x' bytes (say, 1 MiB). Then you call a bulk copy function for 'n - x' bytes, and then call 'go h x'. I think you can make the bulk copy function with EB.isolate and EB.iterHandle.
Cheers, =)
[1] http://hackage.haskell.org/packages/archive/enumerator/0.4.13.1/doc/html/src...
-- Felipe.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe