Attoparsec.ByteString.Char8 or Attoparsec.ByteString for diff output?

Dear Listers, I am developing a program to parse dif output taken from stdin (as in diff file1 file2 | myApp) or from a file. I am reading the input as ByteString in either case and I am parsing it Attoparsec. My question is, Should I use Data.Attoparsec.ByteString.Char8 or Data.Attoparsec.ByteString? So far, I've been using Data.Attoparsec.ByteString.Char8 and it works for my sample files, which are in utf8 or, latin1, or the default Windows encoding. What do you suggest? Note: I sent this question previously to the beginners list, but someone suggested to me privately to send it to this list. Regards, Pedro Borges

On Fri, Feb 17, 2023 at 01:32:48PM -0400, Pedro B. wrote:
I am developing a program to parse dif output taken from stdin (as in diff file1 file2 | myApp) or from a file. I am reading the input as ByteString in either case and I am parsing it Attoparsec. My question is, Should I use Data.Attoparsec.ByteString.Char8 or Data.Attoparsec.ByteString?
So far, I've been using Data.Attoparsec.ByteString.Char8 and it works for my sample files, which are in utf8 or, latin1, or the default Windows encoding.
What do you suggest?
Because the underlying ByteString data type is the same: Data.ByteString ~ Data.ByteString.Char8 you can use either or both sets of combinators as you see fit. The Char8 combinators match the parsed ByteStrings against Char predicates, while the base ByteString combinators match against Word8 predicates. The below is valid: import Data.Attoparsec.ByteString as A8 import Data.Attoparsec.ByteString.Char8 as AC ... myParser :: ... myparser ... = do ... -- parse a Word8 byte followed by an 8-bit Char w <- A8.anyWord8 c <- AC.anyChar ... -- Viktor.

El 17/2/2023 a las 2:08 p. m., Viktor Dukhovni escribió:
On Fri, Feb 17, 2023 at 01:32:48PM -0400, Pedro B. wrote:
I am developing a program to parse dif output taken from stdin (as in diff file1 file2 | myApp) or from a file. I am reading the input as ByteString in either case and I am parsing it Attoparsec. My question is, Should I use Data.Attoparsec.ByteString.Char8 or Data.Attoparsec.ByteString?
So far, I've been using Data.Attoparsec.ByteString.Char8 and it works for my sample files, which are in utf8 or, latin1, or the default Windows encoding.
What do you suggest?
Because the underlying ByteString data type is the same:
Data.ByteString ~ Data.ByteString.Char8
you can use either or both sets of combinators as you see fit. The Char8 combinators match the parsed ByteStrings against Char predicates, while the base ByteString combinators match against Word8 predicates. The below is valid:
import Data.Attoparsec.ByteString as A8 import Data.Attoparsec.ByteString.Char8 as AC
...
myParser :: ... myparser ... = do ... -- parse a Word8 byte followed by an 8-bit Char w <- A8.anyWord8 c <- AC.anyChar ...
Thanks for your answer, Viktor. I am now using base ByteString by default, and Char8 combinators only when needed, as when I have to use AC.char or AC.string. I was confused when I wanted to parse lines coming from the diffed files using "AC.takeTill AC.isEndOfLine". This does not type-check because AC.takeTill expects a predicate on Char8, but AC.isEndOfLine is a predicate on Word8, even when it is defined in the Char8 module, why? Now I am using A8.takeTill AC.isEndOfLine. I was also worried about the warning in the Char8 about truncated bytes. The output actually generated by diff should not have any problem, but the lines coming from the diffed files could be in any encoding. I assumed that AC.takeTill should not cause problems since it does not examine the ByteString except that for the argument predicate. Anyway now I am using A8.takeTill, as I mentioned. Regards, Pedro Borges

You should probably use `Data.Attoparsec.ByteString`. Both let you do the same thing, but `Char8` just uses the wrong type (Chars ['\0'..'\255'] to represent bytes, i.e. Word8). On 2023-02-17 5:32 PM, Pedro B. wrote:
Dear Listers,
I am developing a program to parse dif output taken from stdin (as in diff file1 file2 | myApp) or from a file. I am reading the input as ByteString in either case and I am parsing it Attoparsec. My question is, Should I use Data.Attoparsec.ByteString.Char8 or Data.Attoparsec.ByteString?
So far, I've been using Data.Attoparsec.ByteString.Char8 and it works for my sample files, which are in utf8 or, latin1, or the default Windows encoding.
What do you suggest?
Note: I sent this question previously to the beginners list, but someone suggested to me privately to send it to this list. Regards,
Pedro Borges _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Thanks Li-yao . As I mentioned in my answer to Viktor, I am now using the ByteString functions except when I want to parse Char8's, for example to parse an 'a' with Data.Attoparsec.ByteString.Char8.char 'a'. Regards, Pedro Borges El 17/2/2023 a las 3:12 p. m., Li-yao Xia escribió:
You should probably use `Data.Attoparsec.ByteString`. Both let you do the same thing, but `Char8` just uses the wrong type (Chars ['\0'..'\255'] to represent bytes, i.e. Word8).
On 2023-02-17 5:32 PM, Pedro B. wrote:
Dear Listers,
I am developing a program to parse dif output taken from stdin (as in diff file1 file2 | myApp) or from a file. I am reading the input as ByteString in either case and I am parsing it Attoparsec. My question is, Should I use Data.Attoparsec.ByteString.Char8 or Data.Attoparsec.ByteString?
So far, I've been using Data.Attoparsec.ByteString.Char8 and it works for my sample files, which are in utf8 or, latin1, or the default Windows encoding.
What do you suggest?
Note: I sent this question previously to the beginners list, but someone suggested to me privately to send it to this list. Regards,
Pedro Borges _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

On Mon, Feb 20, 2023 at 10:46:38AM -0400, Pedro B. wrote:
Thanks Li-yao . As I mentioned in my answer to Viktor, I am now using the ByteString functions except when I want to parse Char8's, for example to parse an 'a' with Data.Attoparsec.ByteString.Char8.char 'a'.
FWIW, you can often avoid the Char8 combinators, e.g. for matching a specific 8-bit (ASCII) character, at a modest loss of readability, you can just match its Word8 code point: 0x0a <--- '\n' 0x0d <--- '\r' 0x20 <--- ' ' 0x30 <--- '0' 0x41 <--- 'A' 0x61 <--- 'a' ... I am comfortable with the raw hex values of various "interesting" characters, but you can also define aliases: import Data.Char (ord) char_nl, char_cr, char_sp, char_0, char_A, char_a :: Word8 char_nl = fromIntegral $ ord '\n' char_cr = fromIntegral $ ord '\r' char_sp = fromIntegral $ ord ' ' ... -- Viktor.

El 20/2/2023 a las 1:43 p. m., Viktor Dukhovni escribió:
On Mon, Feb 20, 2023 at 10:46:38AM -0400, Pedro B. wrote:
Thanks Li-yao . As I mentioned in my answer to Viktor, I am now using the ByteString functions except when I want to parse Char8's, for example to parse an 'a' with Data.Attoparsec.ByteString.Char8.char 'a'.
FWIW, you can often avoid the Char8 combinators, e.g. for matching a specific 8-bit (ASCII) character, at a modest loss of readability, you can just match its Word8 code point:
0x0a <--- '\n' 0x0d <--- '\r' 0x20 <--- ' ' 0x30 <--- '0' 0x41 <--- 'A' 0x61 <--- 'a' ...
I am comfortable with the raw hex values of various "interesting" characters, but you can also define aliases:
import Data.Char (ord)
char_nl, char_cr, char_sp, char_0, char_A, char_a :: Word8 char_nl = fromIntegral $ ord '\n' char_cr = fromIntegral $ ord '\r' char_sp = fromIntegral $ ord ' ' ...
I am using the Data.Word8 module provided by the word8 package, which defines _lf, _tab, _cr, and so on, and even _a.._z, _0.._9, etc. For example, I may use (==_tab) as the argument for Data.Attoparsec.ByteString.takeTill. You made me realize that I can use "word8 _a" instead of "char 'a'" and almost have no need for the Char8 combinators. I'll probably do that and only use "decimal" from Char8 to parse integers, which I need to parse line ranges such as "2,10". I still have a doubt though: given that I only match specific characters generated by diff, do I gain something by not using Char8? Performance, perhaps? Regards, Pedro

On Mon, Feb 20, 2023 at 03:58:10PM -0400, Pedro B. wrote:
You made me realize that I can use "word8 _a" instead of "char 'a'" and almost have no need for the Char8 combinators. I'll probably do that and only use "decimal" from Char8 to parse integers, which I need to parse line ranges such as "2,10".
That's the sort of thing I was hinting at.
I still have a doubt though: given that I only match specific characters generated by diff, do I gain something by not using Char8? Performance, perhaps?
In most case performance gains are likely to be marginal at best. The compiler should be able to optimise away most of the possible overhead of Char<->Word8 coversions, but perhaps in some corner cases you might see better performance. Write maintainable code, and only if profiling shows opportunities for meaningful performance gains consider optimisations that might make the code less clear. -- Viktor.

Thanks a lot for your help! Regards, Pedro Borges El 20/2/2023 a las 4:38 p. m., Viktor Dukhovni escribió:
On Mon, Feb 20, 2023 at 03:58:10PM -0400, Pedro B. wrote:
You made me realize that I can use "word8 _a" instead of "char 'a'" and almost have no need for the Char8 combinators. I'll probably do that and only use "decimal" from Char8 to parse integers, which I need to parse line ranges such as "2,10".
That's the sort of thing I was hinting at.
I still have a doubt though: given that I only match specific characters generated by diff, do I gain something by not using Char8? Performance, perhaps?
In most case performance gains are likely to be marginal at best. The compiler should be able to optimise away most of the possible overhead of Char<->Word8 coversions, but perhaps in some corner cases you might see better performance. Write maintainable code, and only if profiling shows opportunities for meaningful performance gains consider optimisations that might make the code less clear.
participants (3)
-
Li-yao Xia
-
Pedro B.
-
Viktor Dukhovni