Re[2]: FPS again

Bulat Ziganshin

15 Jul 2006 15 Jul '06

3:16 p.m.

Hello Donald, can you test that this implementation lines = split 0x0a is as fast as existing (long) ones both for Lazy and Strict ByteString? also, is not it faster to use the following implementation: isSpaceWord8 = (spacesFlagsArray!) ? also, i propose to move getLine/getContents/putStr/interact/readFile-type functions into .Char8 modules (both for strict and lazy bytestrings), because these functions are encoding-dependent and work with texts (as opposite to hGet/hPut which works with raw binary data blocks). in particular, i tried to implement Lazy.hGetLines as 'hGetContents >>= lines' but it was impossible because 'lines' function is defined only in Lazy.Char8 module i send you a bunch of small patches that fixes I/O part of library, providing the same set of operations for lazy and strict bytestrings, for ghc and non-ghc platforms also, i run into small problems using FPS repository to development (seems that i'm first windows developer of the lib). First, i propose to change darcs 'prefs' file to the following: test cd tests && make fast - it should work both on unix and windows second, i've changed 'time' calls in tests/Makefile to use my own 't' utility instead of 'time'. but of course it's not universal solution. at least, 'time' in windows shell (cmd.exe) is _built-in_ utility that don't have anything common with unix 'time' :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Show replies by date

Duncan Coutts

15 Jul 15 Jul

4:04 p.m.

On Sat, 2006-07-15 at 19:16 +0400, Bulat Ziganshin wrote:

...

Hello Donald,

can you test that this implementation lines = split 0x0a is as fast as existing (long) ones both for Lazy and Strict ByteString?

It might actually be the other way around, that the split implementation could benefit from the work that went into the optimisation of the lines function. I spent quite some time trying to optimise the lines implementation, at least for the Lazy module. To get better performance it relies on the assumption that many lines fit into a chunk. That may not be true for uses of split in general. It's worth investigating. Btw, you can run the benchmarks too, they are included in the fps repo.

...

also, is not it faster to use the following implementation: isSpaceWord8 = (spacesFlagsArray!)?

Benchmark it and tell us which is faster.

...

also, i propose to move getLine/getContents/putStr/interact/readFile-type functions into .Char8 modules (both for strict and lazy bytestrings), because these functions are encoding-dependent and work with texts (as opposite to hGet/hPut which works with raw binary data blocks).

Yes, getLine and putStrLn are encoding dependent (they know the encoding of '\n'). getContents, putStr, readfile, interact etc are encoding-independent, they're just the same as hGet/hPut, working on binary data blocks. Indeed putStr = hPut stdout.

...

in particular, i tried to implement Lazy.hGetLines as 'hGetContents >>= lines' but it was impossible because 'lines' function is defined only in Lazy.Char8 module

Yes, that's the way it should be. And of course there is no need for hGetLines in the Lazy module since it is just hGetContents >>= lines In my opinion the hGetLines in the other module should be removed too as it's just a special case of what the Lazy module does.

...

i send you a bunch of small patches that fixes I/O part of library, providing the same set of operations for lazy and strict bytestrings, for ghc and non-ghc platforms

also, i run into small problems using FPS repository to development (seems that i'm first windows developer of the lib). First, i propose to change darcs 'prefs' file to the following:

test cd tests && make fast

- it should work both on unix and windows

Fair enough. :-)

...

second, i've changed 'time' calls in tests/Makefile to use my own 't' utility instead of 'time'. but of course it's not universal solution. at least, 'time' in windows shell (cmd.exe) is _built-in_ utility that don't have anything common with unix 'time' :)

Duncan

Bulat Ziganshin

5:57 p.m.

New subject: Re[4]: FPS again

Hello Duncan, Saturday, July 15, 2006, 8:04:26 PM, you wrote:

...

...
can you test that this implementation lines = split 0x0a is as fast as existing (long) ones both for Lazy and Strict ByteString?

...

It might actually be the other way around, that the split implementation could benefit from the work that went into the optimisation of the lines function. I spent quite some time trying to optimise the lines implementation, at least for the Lazy module. To get better performance it relies on the assumption that many lines fit into a chunk. That may not be true for uses of split in general. It's worth investigating.

well, you know this problem much deeper than me. so i'm shutting up :) although i can say that strict ByteString should benefit from your implementation too (both for lines and split, for obvious reasons) imho, Lazy.split should just use (map P.split) and then join lines that was split between adjacent blocks

...

Btw, you can run the benchmarks too, they are included in the fps repo.

...

...
also, is not it faster to use the following implementation: isSpaceWord8 = (spacesFlagsArray!)?

...

Benchmark it and tell us which is faster.

can my laziness be enough justification? :)

...

...
also, i propose to move getLine/getContents/putStr/interact/readFile-type functions into .Char8 modules (both for strict and lazy bytestrings), because these functions are encoding-dependent and work with texts (as opposite to hGet/hPut which works with raw binary data blocks).

...

Yes, getLine and putStrLn are encoding dependent (they know the encoding of '\n'). getContents, putStr, readfile, interact etc are encoding-independent, they're just the same as hGet/hPut, working on binary data blocks. Indeed putStr = hPut stdout.

they all work with text files, so they are also encoding-dependent (translating CR+LF to LF on windows). putStr is only exception, but it can be moved for company :) this will make clear distinction between functions using ByteString as raw sequence of bytes (hGet/hPut) and functions using ByteString as packed String representing text data

...

...
in particular, i tried to implement Lazy.hGetLines as 'hGetContents >>= lines' but it was impossible because 'lines' function is defined only in Lazy.Char8 module

...

Yes, that's the way it should be. And of course there is no need for hGetLines in the Lazy module since it is just hGetContents >>= lines In my opinion the hGetLines in the other module should be removed too as it's just a special case of what the Lazy module does.

it's also possible. but the situation when one ByteString implementation supports particular function while another don't imho is not very good. user should be able to switch between implementations w/o rewriting his entire program btw, you may be interested to know that i implemented in Streams lib mmapBinaryFile, based on the code from ByteString. it works both on Windows and Unix, using universal mmap API i described in letter to David Roundy -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Duncan Coutts

6:20 p.m.

New subject: Re[4]: FPS again

On Sat, 2006-07-15 at 21:57 +0400, Bulat Ziganshin wrote:

...

Hello Duncan,

Saturday, July 15, 2006, 8:04:26 PM, you wrote:

...
...
can you test that this implementation lines = split 0x0a is as fast as existing (long) ones both for Lazy and Strict ByteString?

...
It might actually be the other way around, that the split implementation could benefit from the work that went into the optimisation of the lines function. I spent quite some time trying to optimise the lines implementation, at least for the Lazy module. To get better performance it relies on the assumption that many lines fit into a chunk. That may not be true for uses of split in general. It's worth investigating.

well, you know this problem much deeper than me. so i'm shutting up :)

although i can say that strict ByteString should benefit from your implementation too (both for lines and split, for obvious reasons)

imho, Lazy.split should just use (map P.split) and then join lines that was split between adjacent blocks

That's what I did first. Keeping track of re-joining bits between adjacent blocks adds quite a bit of bookkeeping overhead.

...

...
Btw, you can run the benchmarks too, they are included in the fps repo.

...
...
also, is not it faster to use the following implementation: isSpaceWord8 = (spacesFlagsArray!)?

...
Benchmark it and tell us which is faster.

can my laziness be enough justification? :)

...
...
also, i propose to move getLine/getContents/putStr/interact/readFile-type functions into .Char8 modules (both for strict and lazy bytestrings), because these functions are encoding-dependent and work with texts (as opposite to hGet/hPut which works with raw binary data blocks).

...
Yes, getLine and putStrLn are encoding dependent (they know the encoding of '\n'). getContents, putStr, readfile, interact etc are encoding-independent, they're just the same as hGet/hPut, working on binary data blocks. Indeed putStr = hPut stdout.

they all work with text files, so they are also encoding-dependent (translating CR+LF to LF on windows). putStr is only exception, but it can be moved for company :)

Ok fair enough, they should be using openBinaryFile then rather than openFile.

...

this will make clear distinction between functions using ByteString as raw sequence of bytes (hGet/hPut) and functions using ByteString as packed String representing text data

There really is no difference with hGet/hPut. readFile/writeFile etc are implemented using hGet/hPut.

...

...
...
in particular, i tried to implement Lazy.hGetLines as 'hGetContents >>= lines' but it was impossible because 'lines' function is defined only in Lazy.Char8 module

...
Yes, that's the way it should be. And of course there is no need for hGetLines in the Lazy module since it is just hGetContents >>= lines In my opinion the hGetLines in the other module should be removed too as it's just a special case of what the Lazy module does.

it's also possible. but the situation when one ByteString implementation supports particular function while another don't imho is not very good. user should be able to switch between implementations w/o rewriting his entire program

Yeah, I think we should eliminate hGetLines partly for that reason.

...

btw, you may be interested to know that i implemented in Streams lib mmapBinaryFile, based on the code from ByteString. it works both on Windows and Unix, using universal mmap API i described in letter to

Sounds good. If we can get a universal mmap API into the base lib then we can add mmapFile back into the ByteString module (it's currently got a commented-out posix version). Duncan

Bulat Ziganshin

16 Jul 16 Jul

3:19 p.m.

New subject: Re[6]: FPS again

Hello Duncan, Saturday, July 15, 2006, 10:20:08 PM, you wrote:

...

...
btw, you may be interested to know that i implemented in Streams lib mmapBinaryFile, based on the code from ByteString. it works both on Windows and Unix, using universal mmap API i described in letter to

...

Sounds good. If we can get a universal mmap API into the base lib then we can add mmapFile back into the ByteString module (it's currently got a commented-out posix version).

technically it's easy - just copy two files (System.FD and System.MMFile) from my lib. but politically i don't know what is the best way :( at least, file mapping already implemented in Win32 library -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin

15 Jul 15 Jul

6:14 p.m.

New subject: Re[4]: FPS again

Hello Duncan, Saturday, July 15, 2006, 8:04:26 PM, you wrote:

...

getContents, putStr, readfile, interact etc are encoding-independent, they're just the same as hGet/hPut, working on binary data blocks. Indeed putStr = hPut stdout.

one shortage i've seen in the library is that you don't see difference between Text and Binary modes of file open. indeed, on the Unix it's the same, but not on windows. below is my conversation on this topic with Donald. finally he applied changes i proposed (using of openFile instead of openBinaryFile in these operations) and today i sent to him patch that does the same change in Lazy module

...

...
the System.IO contains the following definitions:

readFile name = openFile name ReadMode >>= hGetContents

writeFile name str = do hdl <- openFile name WriteMode ... appendFile name str = do hdl <- openFile name AppendMode ...

As you can see, file is open in text mode, while your definitions open files in Binary mode:

readFile f = bracket (openBinaryFile f ReadMode) hClose (\h -> hFileSize h >>= hGet h . fromIntegral)

writeFile f ps = bracket (openBinaryFile f WriteMode) hClose (\h -> hPut h ps)

appendFile f txt = bracket (openBinaryFile f AppendMode) hClose (\hdl -> hPut hdl txt)

...

I don't understand your point here. Do you mean I should be opening in Text mode, since its not portable in Binary mode? Can you clarify?

just for case you don't know - due the history roots, different operation systems has different line end sequences - Unix use chr(10), classical Mac OS - chr(13), while DOS/Windows uses chr(13)+chr(10) In order to allow writing universal text-processing programs that works with any OS, standard C libraries implemented ability to open files in "text mode", in which case OS-specific line ends translated by the library to standard Unix ones when reading, and vice versa when writing System.IO routines i mentioned also opens files in text mode which means that they will correctly translate on Windows 13+10 line ends (standard for this OS) to the chr(10). This means that any text-processing functions written with translated (aka Unix) line ends in mind, will work correctly (with contents of files read/written by mentioned System.IO routines) even on Windows for example, 2-line text file on Windows may contain something like "line1\r\nline2". When read by openBinaryFile and split by 'lines', the result will be ["line1\r", "line2"], that is incorrect. When read by openFile (which opens files in text mode), Windows-specific line end will be translated to Unix-specific one, so the string read will be "line1\nline2" and the 'lines' will return correct results ["line1", "line2"] So, while under Unix there is absolutely no difference which mode you use to open files, this makes difference on Windows. If original routines uses openFile then these routines are intended to work with _text_ files and their clones should give a chance to text translation too. -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Duncan Coutts

6:23 p.m.

New subject: Re[4]: FPS again

On Sat, 2006-07-15 at 22:14 +0400, Bulat Ziganshin wrote:

...

Hello Duncan,

Saturday, July 15, 2006, 8:04:26 PM, you wrote:

...
getContents, putStr, readfile, interact etc are encoding-independent, they're just the same as hGet/hPut, working on binary data blocks. Indeed putStr = hPut stdout.

one shortage i've seen in the library is that you don't see difference between Text and Binary modes of file open. indeed, on the Unix it's the same, but not on windows. below is my conversation on this topic with Donald. finally he applied changes i proposed (using of openFile instead of openBinaryFile in these operations) and today i sent to him patch that does the same change in Lazy module

...
...
the System.IO contains the following definitions:

readFile name = openFile name ReadMode >>= hGetContents

writeFile name str = do hdl <- openFile name WriteMode ... appendFile name str = do hdl <- openFile name AppendMode ...

As you can see, file is open in text mode, while your definitions open files in Binary mode:

readFile f = bracket (openBinaryFile f ReadMode) hClose (\h -> hFileSize h >>= hGet h . fromIntegral)

writeFile f ps = bracket (openBinaryFile f WriteMode) hClose (\h -> hPut h ps)

appendFile f txt = bracket (openBinaryFile f AppendMode) hClose (\hdl -> hPut hdl txt)

...
I don't understand your point here. Do you mean I should be opening in Text mode, since its not portable in Binary mode? Can you clarify?

just for case you don't know - due the history roots, different operation systems has different line end sequences - Unix use chr(10), classical Mac OS - chr(13), while DOS/Windows uses chr(13)+chr(10)

In order to allow writing universal text-processing programs that works with any OS, standard C libraries implemented ability to open files in "text mode", in which case OS-specific line ends translated by the library to standard Unix ones when reading, and vice versa when writing

System.IO routines i mentioned also opens files in text mode which means that they will correctly translate on Windows 13+10 line ends (standard for this OS) to the chr(10). This means that any text-processing functions written with translated (aka Unix) line ends in mind, will work correctly (with contents of files read/written by mentioned System.IO routines) even on Windows

for example, 2-line text file on Windows may contain something like "line1\r\nline2". When read by openBinaryFile and split by 'lines', the result will be ["line1\r", "line2"], that is incorrect. When read by openFile (which opens files in text mode), Windows-specific line end will be translated to Unix-specific one, so the string read will be "line1\nline2" and the 'lines' will return correct results ["line1", "line2"]

So, while under Unix there is absolutely no difference which mode you use to open files, this makes difference on Windows. If original routines uses openFile then these routines are intended to work with _text_ files and their clones should give a chance to text translation too.

So presumably the correct solution is to have the readFile, writeFile etc in the Data.ByteString module use openBinaryFile and the versions in Data.ByteString.Char8 use openFile. That way the versions that are interpreting strings as text will get the OS's line ending conversions. Duncan

Bulat Ziganshin

16 Jul 16 Jul

8:55 a.m.

New subject: Re[6]: FPS again

Hello Duncan, Saturday, July 15, 2006, 10:23:07 PM, you wrote:

...

...
So, while under Unix there is absolutely no difference which mode you use to open files, this makes difference on Windows. If original routines uses openFile then these routines are intended to work with _text_ files and their clones should give a chance to text translation too.

...

So presumably the correct solution is to have the readFile, writeFile etc in the Data.ByteString module use openBinaryFile and the versions in Data.ByteString.Char8 use openFile. That way the versions that are interpreting strings as text will get the OS's line ending conversions.

i will vote against this because in Haskell I/O system there is already informal principle that functions which open files in text mode has plain names (openFile, readFile and so on), while functions which open files in binary mode has 'Binary' in its names (openBinaryFile, at least). your proposal will add another way to distinct between functions working with text and binary files - based on the module where they are defined. this will complicate the things without need. it will be much better to continue existing conventions and name such functions readBinaryFile and so on. but because such functions was not requested for the standard lib, i guess that they also not much required for the new one. so, i propose to move these text-file manipulating operations to the .Char8 modules and don't implement readBinaryFile/... at this moment btw, mapFile should be also renamed to mapBinaryFile because it can't translate text files on Windows. such name will make it clear for users of this function and may save many nerves -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Duncan Coutts

1:17 p.m.

New subject: Re[6]: FPS again

On Sun, 2006-07-16 at 12:55 +0400, Bulat Ziganshin wrote:

...

...
So presumably the correct solution is to have the readFile, writeFile etc in the Data.ByteString module use openBinaryFile and the versions in Data.ByteString.Char8 use openFile. That way the versions that are interpreting strings as text will get the OS's line ending conversions.

i will vote against this because in Haskell I/O system there is already informal principle that functions which open files in text mode has plain names (openFile, readFile and so on), while functions which open files in binary mode has 'Binary' in its names (openBinaryFile, at least). your proposal will add another way to distinct between functions working with text and binary files - based on the module where they are defined.

True. That is exactly what the Data.ByteString / Data.ByteString.Char8 module split does however. It takes two interpretations of blocks of binary data. One interprets them just as sequences of bytes. The other interprets the data as a string type of which ASCII is a subset.

...

this will complicate the things without need. it will be much better to continue existing conventions and name such functions readBinaryFile and so on. but because such functions was not requested for the standard lib, i guess that they also not much required for the new one. so, i propose to move these text-file manipulating operations to the .Char8 modules and don't implement readBinaryFile/... at this moment

It's crucial to be able to read ordinary ByteStrings with a binary interpretation, otherwise we cannot parse binary protocols or files, or indeed even read non-ascii text files. I can see that there is an argument for calling them: Data.ByteString.readBinaryFile Data.ByteString.Char8.readFile Let's see what other people think. Duncan

Bulat Ziganshin

3 p.m.

New subject: Re[8]: FPS again

Hello Duncan, Sunday, July 16, 2006, 5:17:29 PM, you wrote:

...

It's crucial to be able to read ordinary ByteStrings with a binary interpretation, otherwise we cannot parse binary protocols or files, or indeed even read non-ascii text files.

...

I can see that there is an argument for calling them:

...

Data.ByteString.readBinaryFile Data.ByteString.Char8.readFile

...

Let's see what other people think.

it's ok from my viewpoint -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

6931

Age (days ago)

6932

Last active (days ago)

List overview

Download

9 comments

2 participants

participants (2)

Bulat Ziganshin
Duncan Coutts

Re[2]: FPS again

Bulat Ziganshin

Duncan Coutts

Bulat Ziganshin

Duncan Coutts

Bulat Ziganshin

Bulat Ziganshin

Duncan Coutts

Bulat Ziganshin

Duncan Coutts

Bulat Ziganshin

tags

participants (2)