Correct parsers for bounded integral values
Hello, I'd like to bring to your attention a discussion that I have started over at Haskell-cafe [1]. I was complaining about the silent overflow of parsers for bounded integers: > read "298" :: Word8 42 I find this unsatisfying, and I have demonstrated a solution [2] that seems correct and performant. To avoid cross posting, I'd rather point you to the discussion on the Haskell-cafe mailing list [1]. I would like to help get this into GHC and also other parsing libraries. Not all libraries suffer from this issue, but some do, including GHC and its `base` package. How can I attempt this? Should I file a bug report? Or try to follow the instructions at ghc.dev [3] and submit a pull request? I would start tinkering at [this point][4] as soon as I manage to compile GHC. Would you want to share any advice beforehand? I'd really appreciate to be stopped before wasting effort ;-) Also, I probably only can work on this on weekends. Cheers Stefan [1]: https://mail.haskell.org/pipermail/haskell-cafe/2025-July/137134.html [2]: https://github.com/s5k6/robust-int [3]: https://ghc.dev/ [4]: https://hackage.haskell.org/package/ghc-internal-9.1201.0/docs/src/GHC.Inter... -- Stefan Klinger, Ph.D. -- computer scientist o/X http://stefan-klinger.de /\/ https://github.com/s5k6 \ I prefer receiving plain text messages, not exceeding 32kB.
Hi Stefan, this looks like great (and well explained) work! To modify the behavior of existing libraries you need to tackle the change per library and upstream it by opening a ticket on the associated issue tracker and discussing with the developers/maintainers. For `base` itself there is a standardized process to introduce changes to the behavior of the standard library. It’s called a Core Libraries Committee (CLC) Proposal[1]. Since you have already written up very precisely the problem, existing behavior, and proposed change, you should be able to copy it relatively verbatim to a ticket on the CLC to begin an official discussion. That said, it is very good you have posted it to haskell-cafe and to ghc-devs, as that hopefully makes a wider audience interested in your proposal and gives them a chance to discuss it publicly before starting the formal process of a proposal. Regarding the implementation for `base` specifically, you can indeed open a bug report to discuss with ghc and core libraries devs the issue and implementation — that said, the final decision regarding changes to `base` lies with the CLC and has to go through the proposal process. Typically these kinds of tickets also serve to track the upstream CLC proposal once it’s open. Submitting a MR sounds like a good idea as well. You may find it interesting to get GHC compiling, your change applied, and see how CI responds to it :). The place you pointed at in the `base` module looks about right. By the way, if you find any trouble with compiling GHC or applying your change, do reach out and we can help you get it working. You can ping me (@alt-romes) on your Merge Request, or e.g. shoot an email here. I don’t have a strong opinion regarding `read` for `Word8`, so I hope other devs may comment on the contents of your proposal more specifically. Good luck! Rodrigo Mesquita [1]: https://github.com/haskell/core-libraries-committee
On 20 Jul 2025, at 20:12, Stefan Klinger
wrote: Hello,
I'd like to bring to your attention a discussion that I have started over at Haskell-cafe [1]. I was complaining about the silent overflow of parsers for bounded integers:
read "298" :: Word8 42
I find this unsatisfying, and I have demonstrated a solution [2] that seems correct and performant.
To avoid cross posting, I'd rather point you to the discussion on the Haskell-cafe mailing list [1].
I would like to help get this into GHC and also other parsing libraries. Not all libraries suffer from this issue, but some do, including GHC and its `base` package.
How can I attempt this? Should I file a bug report? Or try to follow the instructions at ghc.dev [3] and submit a pull request? I would start tinkering at [this point][4] as soon as I manage to compile GHC.
Would you want to share any advice beforehand? I'd really appreciate to be stopped before wasting effort ;-) Also, I probably only can work on this on weekends.
Cheers Stefan
[1]: https://mail.haskell.org/pipermail/haskell-cafe/2025-July/137134.html [2]: https://github.com/s5k6/robust-int [3]: https://ghc.dev/ [4]: https://hackage.haskell.org/package/ghc-internal-9.1201.0/docs/src/GHC.Inter...
-- Stefan Klinger, Ph.D. -- computer scientist o/X http://stefan-klinger.de /\/ https://github.com/s5k6 \ I prefer receiving plain text messages, not exceeding 32kB. _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
On Sun, Jul 20, 2025 at 09:12:20PM +0200, Stefan Klinger wrote:
I'd like to bring to your attention a discussion that I have started over at Haskell-cafe [1]. I was complaining about the silent overflow of parsers for bounded integers:
> read "298" :: Word8 42
FWIW, there haven't AFAIK any complaints about ByteString's readInt, readWord, readInteger, readNatural and various sized variants having overflow checks. But these have always been more like `reads` than `read`, returning `Maybe (a, ByteString)`, so perhaps somewhat more oriented towards detecting unexpected excess input, as well as for some time now range overflow. So there's some precedent for overflow checking, but... It is also fair to point out that once an Int or other bounded integral type is read, arithmetic with that type (addition, subtraction and multiplication) silently overflows. And so silent overflow in `read` is not inconsistent with the type's semantics. If converting strings to numbers is in support of string-oriented network protocols (e.g. the SIZE ESMTP extension), then one really should make an effort to avoid silent overflow, but in that context the various ByteString read methods are already available. That said, if various middleware libraries hide overflows, because under the covers thay're using `read`, that could be a problem, so we do want the ecosystem at large to make sensible choices about when silent overflow may or may not be appropriate. Perhaps that means having both wrapping and overflow-checked implementations available, and clear docs with each about its behaviour and the corresponding alternative.
I find this unsatisfying, and I have demonstrated a solution [2] that seems correct and performant.
A few of quick observations about [2]: - It disallows expliccit leading "+" (just like "read", but perhaps that should be tolerated). - It disallows multiple leading zeros, perhaps these should be tolerated. - It disallows "-0", perhaps these should be tolerated, as well as "-0000", "-000001", ... (With lazy ByteStrings, which might never terminate, there is a generous, but sensible limit on the number of leading zeros allowed). - One way to avoid difficulties with handling negative minBound is to parse signed values via the corresponding unsigned type, which can accommodate `-minBound` as a positive value, and then negate the final result. This makse possible sharing the low-level digit-by-digit code between the positive and negative cases. If parsing of Integer and Natual is also in scope, I would expect that it avoids doing multi-precision arithmetic for each digit, parsing groups of digits into ~Word sized blocks, and merge the blocks hierarchically with only a logarithmic number of MP multiplies. -- Viktor.
Thanks for the encouragement Rodrigo! I'll follow the process and hope to open a ticket soon. Viktor Dukhovni (2025-Jul-21, excerpt):
It is also fair to point out that once an Int or other bounded integral type is read, arithmetic with that type (addition, subtraction and multiplication) silently overflows. And so silent overflow in `read` is not inconsistent with the type's semantics.
I see parsing as a boundary between an outside world (throwing text at me) and an inside world, where I have programmed some algorithm. As programmer, it is my responsibility to ensure that the types are chosen so that the algorithm works correctly, ideally on any accepted input, i.e., I have to guarantee that no inadvertent overflow happens in this inside world. However, calculating away based on misinterpreted input, will lead to invalid results. Viktor Dukhovni (2025-Jul-21, excerpt):
That said, if various middleware libraries hide overflows, because under the covers thay're using `read`, that could be a problem, so we do want the ecosystem at large to make sensible choices about when silent overflow may or may not be appropriate. Perhaps that means having both wrapping and overflow-checked implementations available, and clear docs with each about its behaviour and the corresponding alternative.
I did not realise this clearly enough before, but have elaborated a bit on Haskell-cafe [1]. We do have unbounded `read :: String -> Integer` and silently overflowing `fromInteger :: Integer -> Word8`, which can be combined if overflow is desired. This follows the idea to be explicit about dangerous things. In addition, we have `read :: String -> Word8` and company, which I'd like to fix.
A few of quick observations about [2]:
Thank you =)
- It disallows expliccit leading "+" (just like "read", but perhaps that should be tolerated).
Yes, it probably should not be that strict. For my own projects I assumed it easier to make it more forgiving later, than the other way round. There really should be consensus on whether or not leading `+` or `0` should be allowed. But these are fixes to make towards the end, I guess.
- It disallows multiple leading zeros, perhaps these should be tolerated.
- It disallows "-0", perhaps these should be tolerated, as well as "-0000", "-000001", ... (With lazy ByteStrings, which might never terminate, there is a generous, but sensible limit on the number of leading zeros allowed).
I ruled this out because I wanted a simple guarantee for termination. Your idea of “generous, but sensible” sounds compelling, the leading `0`s can be cosumed in constant space, we need not keep them.
- One way to avoid difficulties with handling negative minBound is to parse signed values via the corresponding unsigned type, which can accommodate `-minBound` as a positive value, and then negate the final result. This makse possible sharing the low-level digit-by-digit code between the positive and negative cases.
How do you mean? I did not get this “accommodate `-minBound` as a positive value” right, my initial approach to use char '-' >> negate <$> parseUnsigned (negate minBound) fails, exactly because the negation of the lower bound may not be (read: is usually not) within the upper bound, and thus wraps around, e.g., incorrectly `negate (minBound :: Int8)` → `-128` due to the upper bound of `127`. Viktor Dukhovni (2025-Jul-21, excerpt):
If parsing of Integer and Natual is also in scope […]
No, not at all. I have no reservations against `read` for the unbounded types. That should be left alone. Cheers Stefan [1]: https://mail.haskell.org/pipermail/haskell-cafe/2025-July/137162.html [2]: https://github.com/s5k6/robust-int -- Stefan Klinger, Ph.D. -- computer scientist o/X http://stefan-klinger.de /\/ https://github.com/s5k6 \ I prefer receiving plain text messages, not exceeding 32kB.
On Mon, Jul 21, 2025 at 05:56:46PM +0200, Stefan Klinger wrote:
- One way to avoid difficulties with handling negative minBound is to parse signed values via the corresponding unsigned type, which can accommodate `-minBound` as a positive value, and then negate the final result. This makse possible sharing the low-level digit-by-digit code between the positive and negative cases.
How do you mean? I did not get this “accommodate `-minBound` as a positive value” right, my initial approach to use
char '-' >> negate <$> parseUnsigned (negate minBound)
fails, exactly because the negation of the lower bound may not be (read: is usually not) within the upper bound, and thus wraps around, e.g., incorrectly `negate (minBound :: Int8)` → `-128` due to the upper bound of `127`.
Timing is everything, you're trying to do the negation while the value is still signed, instead it is necessary to convert it to a Natural while also keeping track of the sign! {-# LANGUAGE RequiredTypeArguments #-} -- Ideally, compute both the sign and the absolute value in one go. minBoundAbs :: forall a -> (Bounded a, Integral a) => Natural minBoundAbs a = fromIntegral @Integer $ abs $ fromIntegral @a minBound maxBoundAbs :: forall a -> (Bounded a, Integral a) => Natural maxBoundAbs a = fromIntegral @Integer $ abs $ fromIntegral @a maxBound isNegative :: forall a -> (Bounded a, Integral a) => a -> Bool isNegative x = fromIntegral @Integer x < 0 That said, the Bytestring code does not do that, rather it simply knows that the absolute values of minBound and maxBound of the 8, 16, 32 and 64 bit signed integral types differ only in their last digit, and the last digit of minBound is always 8, while for maxBound it is always 7. For the unsigned types the last digit is always 5. Since you're writing polymorphic parser code, rather than separate logical functions for each of the fundamental integral types, you'll need to take the high road and compute the absolute value of each bound as above while keeping track of its sign. To handle arbitrary Bounded, Integral types, the logic gets a bit more complex, because the upper bound could also be negative, or the lower bound could be non-negative, requiring some care in overflow safe: safeRead :: (Bounded a, Integral a) => String -> a safeReads :: (Bounded a, Integral a) => String -> [(a, String)] -- VIktor.
For base introducing a new function `readBoundedNum :: (Bounded a, Num a) => String -> a` or similar seems very reasonable to me. Changing "read" to throw an exception or similar after decades less so. On 20/07/2025 22:08, Viktor Dukhovni wrote:
On Sun, Jul 20, 2025 at 09:12:20PM +0200, Stefan Klinger wrote:
I'd like to bring to your attention a discussion that I have started over at Haskell-cafe [1]. I was complaining about the silent overflow of parsers for bounded integers:
> read "298" :: Word8 42 FWIW, there haven't AFAIK any complaints about ByteString's readInt, readWord, readInteger, readNatural and various sized variants having overflow checks. But these have always been more like `reads` than `read`, returning `Maybe (a, ByteString)`, so perhaps somewhat more oriented towards detecting unexpected excess input, as well as for some time now range overflow. So there's some precedent for overflow checking, but...
It is also fair to point out that once an Int or other bounded integral type is read, arithmetic with that type (addition, subtraction and multiplication) silently overflows. And so silent overflow in `read` is not inconsistent with the type's semantics.
If converting strings to numbers is in support of string-oriented network protocols (e.g. the SIZE ESMTP extension), then one really should make an effort to avoid silent overflow, but in that context the various ByteString read methods are already available.
That said, if various middleware libraries hide overflows, because under the covers thay're using `read`, that could be a problem, so we do want the ecosystem at large to make sensible choices about when silent overflow may or may not be appropriate. Perhaps that means having both wrapping and overflow-checked implementations available, and clear docs with each about its behaviour and the corresponding alternative.
I find this unsatisfying, and I have demonstrated a solution [2] that seems correct and performant. A few of quick observations about [2]:
- It disallows expliccit leading "+" (just like "read", but perhaps that should be tolerated).
- It disallows multiple leading zeros, perhaps these should be tolerated.
- It disallows "-0", perhaps these should be tolerated, as well as "-0000", "-000001", ... (With lazy ByteStrings, which might never terminate, there is a generous, but sensible limit on the number of leading zeros allowed).
- One way to avoid difficulties with handling negative minBound is to parse signed values via the corresponding unsigned type, which can accommodate `-minBound` as a positive value, and then negate the final result. This makse possible sharing the low-level digit-by-digit code between the positive and negative cases.
If parsing of Integer and Natual is also in scope, I would expect that it avoids doing multi-precision arithmetic for each digit, parsing groups of digits into ~Word sized blocks, and merge the blocks hierarchically with only a logarithmic number of MP multiplies.
Hello, I also think that the instances for the bounded types are pretty unfortunate, but the change might have unintended consequences. I am not particularly opposed to it though. One thing to consider, though, is that it might be more productive to change the other parsing libraries (parsec, etc). For example, I almost never use ReadP for actual parsing that requires validation: it is slow and has no error reporting. I've only really used it for quick and dirty serialization in combination with Show, and there the problem is less likely to happen. Cheers, Iavor On Mon, Jul 21, 2025, 11:54 AM Andreas Klebinger via ghc-devs < ghc-devs@haskell.org> wrote:
For base introducing a new function `readBoundedNum :: (Bounded a, Num a) => String -> a` or similar seems very reasonable to me. Changing "read" to throw an exception or similar after decades less so.
On 20/07/2025 22:08, Viktor Dukhovni wrote:
On Sun, Jul 20, 2025 at 09:12:20PM +0200, Stefan Klinger wrote:
I'd like to bring to your attention a discussion that I have started over at Haskell-cafe [1]. I was complaining about the silent overflow of parsers for bounded integers:
> read "298" :: Word8 42 FWIW, there haven't AFAIK any complaints about ByteString's readInt, readWord, readInteger, readNatural and various sized variants having overflow checks. But these have always been more like `reads` than `read`, returning `Maybe (a, ByteString)`, so perhaps somewhat more oriented towards detecting unexpected excess input, as well as for some time now range overflow. So there's some precedent for overflow checking, but...
It is also fair to point out that once an Int or other bounded integral type is read, arithmetic with that type (addition, subtraction and multiplication) silently overflows. And so silent overflow in `read` is not inconsistent with the type's semantics.
If converting strings to numbers is in support of string-oriented network protocols (e.g. the SIZE ESMTP extension), then one really should make an effort to avoid silent overflow, but in that context the various ByteString read methods are already available.
That said, if various middleware libraries hide overflows, because under the covers thay're using `read`, that could be a problem, so we do want the ecosystem at large to make sensible choices about when silent overflow may or may not be appropriate. Perhaps that means having both wrapping and overflow-checked implementations available, and clear docs with each about its behaviour and the corresponding alternative.
I find this unsatisfying, and I have demonstrated a solution [2] that seems correct and performant. A few of quick observations about [2]:
- It disallows expliccit leading "+" (just like "read", but perhaps that should be tolerated).
- It disallows multiple leading zeros, perhaps these should be tolerated.
- It disallows "-0", perhaps these should be tolerated, as well as "-0000", "-000001", ... (With lazy ByteStrings, which might never terminate, there is a generous, but sensible limit on the number of leading zeros allowed).
- One way to avoid difficulties with handling negative minBound is to parse signed values via the corresponding unsigned type, which can accommodate `-minBound` as a positive value, and then negate the final result. This makse possible sharing the low-level digit-by-digit code between the positive and negative cases.
If parsing of Integer and Natual is also in scope, I would expect that it avoids doing multi-precision arithmetic for each digit, parsing groups of digits into ~Word sized blocks, and merge the blocks hierarchically with only a logarithmic number of MP multiplies.
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
participants (5)
-
Andreas Klebinger -
Iavor Diatchki -
Rodrigo Mesquita -
Stefan Klinger -
Viktor Dukhovni