parsing currency amounts with parsec

Hi everyone, I am relatively new to Haskell and Parsec, and I couldn't find any articles on parsing numbers in the following format: Positive: $115.33 Negative: ($1,323.42) I'm working on the parser for practical purposes (to convert a 3rd-party generated, most unhelpful format into one I can use), and I'd really appreciate any insight into a better way to do this, or if there are any built-in functions/established libraries that would be better suited to the task. My code below works, but doesn't seem terribly efficient. Thanks! Eric ------------------------------------------------- {- parses positive and negative dollar amounts -} integer :: CharParser st Integer integer = PT.integer lexer float :: CharParser st Double float = PT.float lexer currencyAmount = try negAmount <|> posAmount negAmount = do char '(' char '$' a <- currency char ')' return (negate a) posAmount = do char '$' a <- currency return a currency = do parts <- many floatOrSep let result = combine orderedParts where combine = sumWithFactor 1 orderedParts = reverse parts return result floatOrSep = try float <|> beforeSep beforeSep = do a <- integer char ',' return (fromIntegral a :: Double) sumWithFactor n [] = 0 sumWithFactor n (x:xs) = n * x + next where next = sumWithFactor (n*1000) xs

On Mon, May 9, 2011 at 5:07 PM, Eric Rasmussen
Hi everyone,
I am relatively new to Haskell and Parsec, and I couldn't find any articles on parsing numbers in the following format:
Positive: $115.33 Negative: ($1,323.42)
I'm working on the parser for practical purposes (to convert a 3rd-party generated, most unhelpful format into one I can use), and I'd really appreciate any insight into a better way to do this, or if there are any built-in functions/established libraries that would be better suited to the task. My code below works, but doesn't seem terribly efficient.
Why do you think it inefficient? Is it slow? I don't have any substantial suggestions, but from a style perspective: * I would question the use of IEEE binary-floating-point number types for currency. Haskell ships with a fixed-point decimal library, but I don't know how fast it is: http://hackage.haskell.org/packages/archive/base/latest/doc/html/Data-Fixed.... * You can re-use the 'positive' parser in the 'negative' one:
negAmount = do char '(' a <- posAmount char ')' return (negate a)
* In the 'currencyAmount' declaration, I'm not sure that you need the 'try' before the 'negAmount', but I don't know what the rest of your grammar looks like.
Thanks! Eric
------------------------------------------------- {- parses positive and negative dollar amounts -}
integer :: CharParser st Integer integer = PT.integer lexer
float :: CharParser st Double float = PT.float lexer
currencyAmount = try negAmount <|> posAmount
negAmount = do char '(' char '$' a <- currency char ')' return (negate a)
posAmount = do char '$' a <- currency return a
currency = do parts <- many floatOrSep let result = combine orderedParts where combine = sumWithFactor 1 orderedParts = reverse parts return result
floatOrSep = try float <|> beforeSep
beforeSep = do a <- integer char ',' return (fromIntegral a :: Double)
sumWithFactor n [] = 0 sumWithFactor n (x:xs) = n * x + next where next = sumWithFactor (n*1000) xs
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 5/9/11 10:04 PM, Antoine Latter wrote:
On Mon, May 9, 2011 at 5:07 PM, Eric Rasmussen
wrote: Hi everyone,
I am relatively new to Haskell and Parsec, and I couldn't find any articles on parsing numbers in the following format:
Positive: $115.33 Negative: ($1,323.42)
I'm working on the parser for practical purposes (to convert a 3rd-party generated, most unhelpful format into one I can use), and I'd really appreciate any insight into a better way to do this, or if there are any built-in functions/established libraries that would be better suited to the task. My code below works, but doesn't seem terribly efficient.
Why do you think it inefficient? Is it slow?
I don't have any substantial suggestions, but from a style perspective:
* I would question the use of IEEE binary-floating-point number types for currency. Haskell ships with a fixed-point decimal library, but I don't know how fast it is: http://hackage.haskell.org/packages/archive/base/latest/doc/html/Data-Fixed....
There are also a few other options (that I can't seem to find links to at the moment), and some hints on how to do it yourself: http://augustss.blogspot.com/2007/04/overloading-haskell-numbers-part-3.html I'm not sure if your choice of parsing library is fixed or not, but you could probably speed things up significantly by using Attoparsec. In particular, Attoparsec's combinators for takeWhile, takeWhile1,scan,... return bytestrings, and you can then fold over the bytestring quite nicely. If your source is ASCII or anything ASCII compatible (Latin-1, Latin-9, UTF-8,...) then take a look at Data.Attoparsec.Char8.decimal[1][2]. Unless you need to verify that commas occur exactly every third digit, I'd suggest (a) dropping commas while scanning the string, or (b) implicitly dropping commas while folding over the string. If you do need to verify this, then your best option is (a), since you can maintain a state machine about how many digits seen since the last comma. Once you're using the Attoparsec strategy of folding over the raw byte buffers, then the only room for improvement is going to be making the code as straight-line as possible. Some untested example code: import qualified Data.Attoparsec as A import qualified Data.Attoparsec.Char8 as A8 import qualified Data.ByteString as B rawCurrency :: A.Parser (ByteString,ByteString) rawCurrency = do dollars <- A.scan 0 step _ <- A.char '.' -- Assuming it's required... cents <- A.takeWhile1 isDigit_w8 -- Assuming no commas... return (dollars,cents) where step :: Int -> Word8 -> Maybe Int step 3 0x2C = Just 0 step s c | isDigit_w8 c = Just $! s+1 step _ _ = Nothing -- Note: the order of comparisons is part of why it's fast. -- | A fast digit predicate. isDigit_w8 :: Word8 -> Bool isDigit_w8 w = (w <= 0x39 && w >= 0x30) {-# INLINE isDigit_w8 #-} -- With the dots filled in by whatever representation you use. currency :: A.Parser ... currency = do (dollars,cents) <- rawCurrency let step a w = a * 10 + fromIntegral (w - 0x30) d = B.foldl' step 0 (B.filter (/= 0x2C) dollars) c = fromIntegral (B.foldl' step 0 cents) / (10 ^ length cents) return (... d ... c ...) amount :: A.Parser ... amount = pos <|> neg where pos = A8.char '$' *> currency neg = do _ <- A8.string "($" a <- currency _ <- A8.char ')' return (negate a) [1] And if you're using Attoparsec itself, you may want to take a look at Data.Attoparsec.Zepto as well. [2] If you're basing code on Attoparsec, you may want to look at some of my pending patches which improve performance on this (already extremely fast) code: https://bitbucket.org/winterkoninkje/attoparsec/changesets -- Live well, ~wren

I'll check out Attoparsec, thanks! My first attempt may work for this
particular task, but I'm warming up for a more intense parsing project and
it sounds like Attoparsec with Bytestrings may work best.
Also, just in case anyone reads this thread later and is looking for a quick
Parsec solution, I discovered that the code I posted initially was a bit
greedy in a bad way if the dollar amount was at the end of the line. I got
rid of the original currency, floatOrSep, and beforeSep functions and
replaced them with the code below (still verbose, but hopefully a better
starting point for now).
------------------------------------------------------------------------
double = do i <- integer
return (fromIntegral i :: Double)
currency = try float <|> largeAmount
largeAmount = do first <- double
rest <- many afterSep
let parts = first : rest
let result = combine orderedParts where
combine = sumWithFactor 1
orderedParts = reverse parts
return result
afterSep = do char ','
try float <|> double
------------------------------------------------------------------------
On Mon, May 9, 2011 at 8:15 PM, wren ng thornton
On 5/9/11 10:04 PM, Antoine Latter wrote:
On Mon, May 9, 2011 at 5:07 PM, Eric Rasmussen
wrote: Hi everyone,
I am relatively new to Haskell and Parsec, and I couldn't find any articles on parsing numbers in the following format:
Positive: $115.33 Negative: ($1,323.42)
I'm working on the parser for practical purposes (to convert a 3rd-party generated, most unhelpful format into one I can use), and I'd really appreciate any insight into a better way to do this, or if there are any built-in functions/established libraries that would be better suited to the task. My code below works, but doesn't seem terribly efficient.
Why do you think it inefficient? Is it slow?
I don't have any substantial suggestions, but from a style perspective:
* I would question the use of IEEE binary-floating-point number types for currency. Haskell ships with a fixed-point decimal library, but I don't know how fast it is:
http://hackage.haskell.org/packages/archive/base/latest/doc/html/Data-Fixed....
There are also a few other options (that I can't seem to find links to at the moment), and some hints on how to do it yourself:
http://augustss.blogspot.com/2007/04/overloading-haskell-numbers-part-3.html
I'm not sure if your choice of parsing library is fixed or not, but you could probably speed things up significantly by using Attoparsec. In particular, Attoparsec's combinators for takeWhile, takeWhile1,scan,... return bytestrings, and you can then fold over the bytestring quite nicely. If your source is ASCII or anything ASCII compatible (Latin-1, Latin-9, UTF-8,...) then take a look at Data.Attoparsec.Char8.decimal[1][2].
Unless you need to verify that commas occur exactly every third digit, I'd suggest (a) dropping commas while scanning the string, or (b) implicitly dropping commas while folding over the string. If you do need to verify this, then your best option is (a), since you can maintain a state machine about how many digits seen since the last comma. Once you're using the Attoparsec strategy of folding over the raw byte buffers, then the only room for improvement is going to be making the code as straight-line as possible.
Some untested example code:
import qualified Data.Attoparsec as A import qualified Data.Attoparsec.Char8 as A8 import qualified Data.ByteString as B
rawCurrency :: A.Parser (ByteString,ByteString) rawCurrency = do dollars <- A.scan 0 step _ <- A.char '.' -- Assuming it's required... cents <- A.takeWhile1 isDigit_w8 -- Assuming no commas... return (dollars,cents) where step :: Int -> Word8 -> Maybe Int step 3 0x2C = Just 0 step s c | isDigit_w8 c = Just $! s+1 step _ _ = Nothing
-- Note: the order of comparisons is part of why it's fast. -- | A fast digit predicate. isDigit_w8 :: Word8 -> Bool isDigit_w8 w = (w <= 0x39 && w >= 0x30) {-# INLINE isDigit_w8 #-}
-- With the dots filled in by whatever representation you use. currency :: A.Parser ... currency = do (dollars,cents) <- rawCurrency let step a w = a * 10 + fromIntegral (w - 0x30) d = B.foldl' step 0 (B.filter (/= 0x2C) dollars) c = fromIntegral (B.foldl' step 0 cents) / (10 ^ length cents) return (... d ... c ...)
amount :: A.Parser ... amount = pos <|> neg where pos = A8.char '$' *> currency neg = do _ <- A8.string "($" a <- currency _ <- A8.char ')' return (negate a)
[1] And if you're using Attoparsec itself, you may want to take a look at Data.Attoparsec.Zepto as well.
[2] If you're basing code on Attoparsec, you may want to look at some of my pending patches which improve performance on this (already extremely fast) code:
https://bitbucket.org/winterkoninkje/attoparsec/changesets
-- Live well, ~wren
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

* Eric Rasmussen
Hi everyone,
I am relatively new to Haskell and Parsec, and I couldn't find any articles on parsing numbers in the following format:
You could read hledger[1] sources for inspiration: it's written in Haskell and contains some (quite generic) currency parsing. [1]: http://hledger.org/ -- Roman I. Cheplyaka :: http://ro-che.info/ Don't worry what people think, they don't do it very often.

On 5/10/11 2:52 PM, Roman Cheplyaka wrote:
You could read hledger[1] sources for inspiration: it's written in Haskell and contains some (quite generic) currency parsing.
Hi Eric.. here's the code in question: http://hackage.haskell.org/packages/archive/hledger-lib/0.14/doc/html/src/Hl... and some related docs: http://hledger.org/MANUAL.html#amounts It's probably more complicated and less efficient than you need, but a source of ideas.

Very helpful -- thanks everyone! The handling of currency amounts in hledger
is what I was looking for in terms of alternate ways to parse and represent
dollar amounts in Haskell.
On Wed, May 11, 2011 at 6:05 PM, Simon Michael
On 5/10/11 2:52 PM, Roman Cheplyaka wrote:
You could read hledger[1] sources for inspiration: it's written in Haskell and contains some (quite generic) currency parsing.
Hi Eric.. here's the code in question:
http://hackage.haskell.org/packages/archive/hledger-lib/0.14/doc/html/src/Hl...
and some related docs:
http://hledger.org/MANUAL.html#amounts
It's probably more complicated and less efficient than you need, but a source of ideas.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (5)
-
Antoine Latter
-
Eric Rasmussen
-
Roman Cheplyaka
-
Simon Michael
-
wren ng thornton