
Hi, I need to edit big text files (5 to 500 Mb). But I just need to change one or two small lines, and save it. What is the best way to do that in Haskell, without creating copies of the whole files? Thanks, Maurício

briqueabraque:
Hi,
I need to edit big text files (5 to 500 Mb). But I just need to change one or two small lines, and save it. What is the best way to do that in Haskell, without creating copies of the whole files?
I'd think maybe a lazy bytestring would be ok. Something like: import Data.ByteString.Lazy.Char8 B.putStr . B.unlines . B.map edit . B.lines =<< B.getContents in the darcs version of Data.ByteString, here, http://www.cse.unsw.edu.au/~dons/fps.html Let me know how you go, it would make a good benchmark. -- Don

dons:
briqueabraque:
Hi,
I need to edit big text files (5 to 500 Mb). But I just need to change one or two small lines, and save it. What is the best way to do that in Haskell, without creating copies of the whole files?
I'd think maybe a lazy bytestring would be ok.
Something like: import Data.ByteString.Lazy.Char8 B.putStr . B.unlines . B.map edit . B.lines =<< B.getContents
in the darcs version of Data.ByteString, here, http://www.cse.unsw.edu.au/~dons/fps.html Let me know how you go, it would make a good benchmark.
Oh, of course, if you actually don't want to copy the file, you'll need to strictly read the input file, in order to write over it safely. - Don

dons:
briqueabraque:
Hi,
I need to edit big text files (5 to 500 Mb). But I just need to change one or two small lines, and save it. What is the best way to do that in Haskell, without creating copies of the whole files?
Thinking further, since you want to avoid copying on the disk, you need to be able to keep the edited version in memory. So the strict bytestring would be best, for example: import System.Environment import qualified Data.ByteString.Char8 as B main = do [f] <- getArgs B.writeFile f . B.unlines . map edit . B.lines =<< B.readFile f where edit :: B.ByteString -> B.ByteString edit s | (B.pack "Instances") `B.isPrefixOf` s = B.pack "EDIT" | otherwise = s Edits a 100M file in $ ghc -O -funbox-strict-fields A.hs -package fps $ time ./a.out /home/dons/data/100M ./a.out /home/dons/data/100M 1.54s user 0.76s system 13% cpu 17.371 total You could probably tune this further. -- Don

On Fri, Jun 02, 2006 at 12:34:51PM +1000, Donald Bruce Stewart wrote:
dons:
briqueabraque:
Hi,
I need to edit big text files (5 to 500 Mb). But I just need to change one or two small lines, and save it. What is the best way to do that in Haskell, without creating copies of the whole files?
Thinking further, since you want to avoid copying on the disk, you need to be able to keep the edited version in memory. So the strict bytestring would be best, for example:
dons is right here, but I'd add that it's hard to safely edit a big file without creating a copy, if you want your program to leave the file in a consistent state even if it crashes (power failure, kill, file server failure). dons' suggestion could leave you with a deleted file (if power goes down at the beginning of a write). If you aren't changing the size of the file, opening it ReadWrite will allow you to modify it reasonably safeily. If you *are* changing its size, then doing something explicit would probably be the way to go (and I'd probably actually use mmap and memmove to make the change, if you do need to modify the file size). But then, I'm thinking posix (as I generally do), which may not be the case for you. And perhaps you don't need to be careful. I've found that if bad things can happen, they do. But that's largely because darcs has lots of users... -- David Roundy

David Roundy wrote:
On Fri, Jun 02, 2006 at 12:34:51PM +1000, Donald Bruce Stewart wrote:
dons:
briqueabraque:
Hi,
I need to edit big text files (5 to 500 Mb). But I just need to change one or two small lines, and save it. What is the best way to do that in Haskell, without creating copies of the whole files?
Thinking further, since you want to avoid copying on the disk, you need to be able to keep the edited version in memory. So the strict bytestring would be best, for example:
dons is right here, but I'd add that it's hard to safely edit a big file without creating a copy, if you want your program to leave the file in a consistent state even if it crashes (power failure, kill, file server failure). dons' suggestion could leave you with a deleted file (if power goes down at the beginning of a write). If you aren't changing the size of the file, opening it ReadWrite will allow you to modify it reasonably safeily. If you *are* changing its size, then doing something explicit would probably be the way to go (and I'd probably actually use mmap and memmove to make the change, if you do need to modify the file size). But then, I'm thinking posix (as I generally do), which may not be the case for you. And perhaps you don't need to be careful. I've found that if bad things can happen, they do. But that's largely because darcs has lots of users...
I like very much the idea of memory mapping the file. In some situations, random access to the file would help a lot. Can I do that on Windows? Also: safety is not a concern. I can delete those big files as much as I need, they are only temporary data. Maurício

Hello Maurício, Tuesday, June 20, 2006, 12:59:47 AM, you wrote:
I need to edit big text files (5 to 500 Mb). But I just need to change one or two small lines, and save it. What is the best way to do that in Haskell, without creating copies of the whole files? I like very much the idea of memory mapping the file. In some situations, random access to the file would help a lot. Can I do that on Windows?
i wanted to write what my Streams 0.2 library (see previous letter) also supports memory-mapped files on Windows (see Examples/wc4MMFile.hs), but i'm not sure what you really need it. if your edit don't involve moving of data (i.e. new lines will have the same size as overwritten ones), you don't need to use mmfile because it's enough to open file in ReadWriteMode (see "openBinaryFile" call description). if you need to change file size, then you can just read it into large buffer, change all that you want and write it back. -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (4)
-
Bulat Ziganshin
-
David Roundy
-
dons@cse.unsw.edu.au
-
Maurício