
On 5/9/11 10:04 PM, Antoine Latter wrote:
On Mon, May 9, 2011 at 5:07 PM, Eric Rasmussen
wrote: Hi everyone,
I am relatively new to Haskell and Parsec, and I couldn't find any articles on parsing numbers in the following format:
Positive: $115.33 Negative: ($1,323.42)
I'm working on the parser for practical purposes (to convert a 3rd-party generated, most unhelpful format into one I can use), and I'd really appreciate any insight into a better way to do this, or if there are any built-in functions/established libraries that would be better suited to the task. My code below works, but doesn't seem terribly efficient.
Why do you think it inefficient? Is it slow?
I don't have any substantial suggestions, but from a style perspective:
* I would question the use of IEEE binary-floating-point number types for currency. Haskell ships with a fixed-point decimal library, but I don't know how fast it is: http://hackage.haskell.org/packages/archive/base/latest/doc/html/Data-Fixed....
There are also a few other options (that I can't seem to find links to at the moment), and some hints on how to do it yourself: http://augustss.blogspot.com/2007/04/overloading-haskell-numbers-part-3.html I'm not sure if your choice of parsing library is fixed or not, but you could probably speed things up significantly by using Attoparsec. In particular, Attoparsec's combinators for takeWhile, takeWhile1,scan,... return bytestrings, and you can then fold over the bytestring quite nicely. If your source is ASCII or anything ASCII compatible (Latin-1, Latin-9, UTF-8,...) then take a look at Data.Attoparsec.Char8.decimal[1][2]. Unless you need to verify that commas occur exactly every third digit, I'd suggest (a) dropping commas while scanning the string, or (b) implicitly dropping commas while folding over the string. If you do need to verify this, then your best option is (a), since you can maintain a state machine about how many digits seen since the last comma. Once you're using the Attoparsec strategy of folding over the raw byte buffers, then the only room for improvement is going to be making the code as straight-line as possible. Some untested example code: import qualified Data.Attoparsec as A import qualified Data.Attoparsec.Char8 as A8 import qualified Data.ByteString as B rawCurrency :: A.Parser (ByteString,ByteString) rawCurrency = do dollars <- A.scan 0 step _ <- A.char '.' -- Assuming it's required... cents <- A.takeWhile1 isDigit_w8 -- Assuming no commas... return (dollars,cents) where step :: Int -> Word8 -> Maybe Int step 3 0x2C = Just 0 step s c | isDigit_w8 c = Just $! s+1 step _ _ = Nothing -- Note: the order of comparisons is part of why it's fast. -- | A fast digit predicate. isDigit_w8 :: Word8 -> Bool isDigit_w8 w = (w <= 0x39 && w >= 0x30) {-# INLINE isDigit_w8 #-} -- With the dots filled in by whatever representation you use. currency :: A.Parser ... currency = do (dollars,cents) <- rawCurrency let step a w = a * 10 + fromIntegral (w - 0x30) d = B.foldl' step 0 (B.filter (/= 0x2C) dollars) c = fromIntegral (B.foldl' step 0 cents) / (10 ^ length cents) return (... d ... c ...) amount :: A.Parser ... amount = pos <|> neg where pos = A8.char '$' *> currency neg = do _ <- A8.string "($" a <- currency _ <- A8.char ')' return (negate a) [1] And if you're using Attoparsec itself, you may want to take a look at Data.Attoparsec.Zepto as well. [2] If you're basing code on Attoparsec, you may want to look at some of my pending patches which improve performance on this (already extremely fast) code: https://bitbucket.org/winterkoninkje/attoparsec/changesets -- Live well, ~wren