On Mon, May 9, 2011 at 8:15 PM, wren ng thornton <wren@freegeek.org> wrote:

On 5/9/11 10:04 PM, Antoine Latter wrote:

On Mon, May 9, 2011 at 5:07 PM, Eric Rasmussen<ericrasmussen@gmail.com> wrote:

Hi everyone,

I am relatively new to Haskell and Parsec, and I couldn't find any articles
on parsing numbers in the following format:

Positive: $115.33
Negative: ($1,323.42)

I'm working on the parser for practical purposes (to convert a 3rd-party
generated, most unhelpful format into one I can use), and I'd really
appreciate any insight into a better way to do this, or if there are any
built-in functions/established libraries that would be better suited to the
task. My code below works, but doesn't seem terribly efficient.

Why do you think it inefficient? Is it slow?

I don't have any substantial suggestions, but from a style perspective:

* I would question the use of IEEE binary-floating-point number types
for currency. Haskell ships with a fixed-point decimal library, but I
don't know how fast it is:
http://hackage.haskell.org/packages/archive/base/latest/doc/html/Data-Fixed.html

There are also a few other options (that I can't seem to find links to at the moment), and some hints on how to do it yourself:

http://augustss.blogspot.com/2007/04/overloading-haskell-numbers-part-3.html

I'm not sure if your choice of parsing library is fixed or not, but you could probably speed things up significantly by using Attoparsec. In particular, Attoparsec's combinators for takeWhile, takeWhile1,scan,... return bytestrings, and you can then fold over the bytestring quite nicely. If your source is ASCII or anything ASCII compatible (Latin-1, Latin-9, UTF-8,...) then take a look at Data.Attoparsec.Char8.decimal[1][2].

Unless you need to verify that commas occur exactly every third digit, I'd suggest (a) dropping commas while scanning the string, or (b) implicitly dropping commas while folding over the string. If you do need to verify this, then your best option is (a), since you can maintain a state machine about how many digits seen since the last comma. Once you're using the Attoparsec strategy of folding over the raw byte buffers, then the only room for improvement is going to be making the code as straight-line as possible.

Some untested example code:

import qualified Data.Attoparsec as A
import qualified Data.Attoparsec.Char8 as A8
import qualified Data.ByteString as B

rawCurrency :: A.Parser (ByteString,ByteString)
rawCurrency = do
dollars <- A.scan 0 step
_ <- A.char '.' -- Assuming it's required...
cents <- A.takeWhile1 isDigit_w8 -- Assuming no commas...
return (dollars,cents)
where
step :: Int -> Word8 -> Maybe Int
step 3 0x2C = Just 0
step s c | isDigit_w8 c = Just $! s+1
step _ _ = Nothing

-- Note: the order of comparisons is part of why it's fast.
-- | A fast digit predicate.
isDigit_w8 :: Word8 -> Bool
isDigit_w8 w = (w <= 0x39 && w >= 0x30)
{-# INLINE isDigit_w8 #-}

-- With the dots filled in by whatever representation you use.
currency :: A.Parser ...
currency = do
(dollars,cents) <- rawCurrency
let step a w = a * 10 + fromIntegral (w - 0x30)
d = B.foldl' step 0 (B.filter (/= 0x2C) dollars)
c = fromIntegral (B.foldl' step 0 cents)
/ (10 ^ length cents)
return (... d ... c ...)

amount :: A.Parser ...
amount = pos <|> neg
where
pos = A8.char '$' *> currency
neg = do
_ <- A8.string "($"
a <- currency
_ <- A8.char ')'
return (negate a)

[1] And if you're using Attoparsec itself, you may want to take a look at Data.Attoparsec.Zepto as well.

[2] If you're basing code on Attoparsec, you may want to look at some of my pending patches which improve performance on this (already extremely fast) code:

https://bitbucket.org/winterkoninkje/attoparsec/changesets

--
Live well,
~wren

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe