Difference between Lazy ByteStrings and Strings in alex

It was suggested that I might derive some performance benefit from using lazy bytestrings in my tokenizer instead of regular strings. Here's the code that I've tried. Note that I've hacked the "basic" wrapper code in the Lazy version, so the code should be all but the same. The only thing I had to do out of the ordinary was write my own 'take' function instead of using the substring function provided by Data.Lazy.ByteString.Char8. The take function I used was derived from the one GHC uses in GHC.List and produces about the same code. The non-lazy version runs in 38 seconds on a 211MB file versus the lazy versions 41 seconds. That of course doesn't seem like that much, and in the non-lazy case, I have to break the input up into multiple files, whereas I don't have to in the lazy version -- this does not take any extra time. The seconds do add up to a couple of hours for me, though once I'm done, and so I'd like to understand why, when the consensus was that Data.ByteString.Lazy might give me better performance in the end, it doesn't do so here. I am running GHC 2.6 now, and am using -O3 as my optimization parameter. I'm profiling the code now, but was wondering if there was any insight... -- Jeff Non-lazy version { module Main where import qualified FileReader } %wrapper "basic" $letter = [a-zA-Z] $digit = 0-9 $alphanum = [a-zA-Z0-9] $punct = [\! \@ \# \$ \% \^ \& \* \( \) \_ \- \+ \= \{ \[ \} \] \\ \| \; \: \' \" \, \. \? \/ \` \~] $dec = \. $posneg = [\- \+] @date1 = jan($punct|uary)?\ $digit{1,2}(\,\ $digit{2,4})? | feb($punct|ruary)?\ $digit{1,2}(\,\ $digit{2,4})? | mar($punct|ch)?\ $digit{1,2}(\,\ $digit{2,4})? | apr($punct|il)?\ $digit{1,2}(\,\ $digit{2,4})? | may?\ $digit{1,2}(\,\ $digit{2,4})? | jun($punct|e)?\ $digit{1,2}(\,\ $digit{2,4})? | jul($punct|y)?\ $digit{1,2}(\,\ $digit{2,4})? | aug($punct|ust)?\ $digit{1,2}(\,\ $digit{2,4})? | sep($punct|tember)?\ $digit{1,2}(\,\ $digit{2,4})? | sept($punct)?\ $digit{1,2}(\,\ $digit{2,4})? | oct($punct|ober)?\ $digit{1,2}(\,\ $digit{2,4})? | nov($punct|ember)?\ $digit{1,2}(\,\ $digit{2,4})? | dec($punct|ember)?\ $digit{1,2}(\,\ $digit{2,4})? @date2 = $digit{1,2} $punct $digit{1,2} $punct $digit{2,4} @time = $digit{1,2} \: $digit{2} (am|pm)? @word = $alphanum+ @number = $posneg? $digit+ | $posneg? $digit+ $dec $digit+ | $posneg? $digit+ (\,$digit{3})+ | $posneg? $digit? (\,$digit{3})+ $dec $digit+ $white = [\t\r\n\v\f\ ] @doc = \< DOC \> @tag = \< $alphanum+ \> | \<\/ $alphanum+ \> tokens :- @doc { \s -> "" } @tag ; $white+ ; @time { \s -> s } @number { \s -> s } @word { \s -> s } $punct ; . ; { printCount c [] = print c printCount c (l:ls) = if l == "" then printCount (c+1) ls else printCount c ls main = do file <- readFile "trecfile1" printCount 0 (alexScanTokens file) } -- ------------------------------------------------------------------------------------------------------------ Version depending on ByteString.Lazy -- note that the grammar is the same, so it has been omitted -- ------------------------------------------------------------------------------------------------------------ ... grammar ... { type AlexInput = (Char, -- previous char B.ByteString) -- current input string takebytes :: Int -> B.ByteString -> String takebytes (0) _ = "" takebytes n s = c : takebytes (n-1) cs where c = B.index s 0 cs = B.drop 1 s alexGetChar :: AlexInput -> Maybe (Char,AlexInput) alexGetChar (_, bytestring) | bytestring == B.empty = Nothing | otherwise = Just (c , (c,cs)) where c = B.index bytestring 0 cs = B.drop 1 bytestring alexInputPrevChar :: AlexInput -> Char alexInputPrevChar (c,_) = c alexScanTokens :: B.ByteString -> [String] alexScanTokens str = go ('\n',str) where go inp@(_,str) = case alexScan inp 0 of AlexToken inp' len act -> act (takebytes len str) : go inp' AlexSkip inp' len -> go inp' AlexEOF -> [] AlexError _ -> error "lexical error" printCount :: Int -> [String] -> IO () printCount c [] = print c printCount c (l:ls) = if l == "" then printCount (c+1) ls else printCount c ls main = do file <- B.readFile "trecfile1" printCount 0 (alexScanTokens file) }

jeff:
It was suggested that I might derive some performance benefit from using lazy bytestrings in my tokenizer instead of regular strings. Here's the code that I've tried. Note that I've hacked the "basic" wrapper code in the Lazy version, so the code should be all but the same. The only thing I had to do out of the ordinary was write my own 'take' function instead of using the substring function provided by Data.Lazy.ByteString.Char8. The take function I used was derived from the one GHC uses in GHC.List and produces about the same code.
The non-lazy version runs in 38 seconds on a 211MB file versus the lazy versions 41 seconds. That of course doesn't seem like that much, and in the non-lazy case, I have to break the input up into multiple files, whereas I don't have to in the lazy version -- this does not take any extra time. The seconds do add up to a couple of hours for me, though once I'm done, and so I'd like to understand why, when the consensus was that Data.ByteString.Lazy might give me better performance in the end, it doesn't do so here.
I am running GHC 2.6 now, and am using -O3 as my optimization parameter. I'm profiling the code now, but was wondering if there was any insight...
GHC 6.6 you mean? Can you post a complete example, including FileReader, so that I can compile the code, with some example input and output, to work out what's going on? By the way, if you're able to break the file into chunks already, we should able to do even better with a strict ByteString. Cheers, Don

On Tue, Feb 13, 2007 at 10:43:11PM -0500, Jefferson Heard wrote:
I am running GHC 2.6 now, and am using -O3 as my optimization parameter. I'm
I think you will get much better performance with GHC 6.6. The optimizer has been improved a *lot* in the last 10 years. (I hope that was a typo!!)
Non-lazy version
{ module Main where
import qualified FileReader
}
%wrapper "basic"
$letter = [a-zA-Z] $digit = 0-9 $alphanum = [a-zA-Z0-9] $punct = [\! \@ \# \$ \% \^ \& \* \( \) \_ \- \+ \= \{ \[ \} \] \\ \| \; \: \' \" \, \. \? \/ \` \~] $dec = \. $posneg = [\- \+]
@date1 = jan($punct|uary)?\ $digit{1,2}(\,\ $digit{2,4})? | feb($punct|ruary)?\ $digit{1,2}(\,\ $digit{2,4})? | mar($punct|ch)?\ $digit{1,2}(\,\ $digit{2,4})? | apr($punct|il)?\ $digit{1,2}(\,\ $digit{2,4})? | may?\ $digit{1,2}(\,\ $digit{2,4})? | jun($punct|e)?\ $digit{1,2}(\,\ $digit{2,4})? | jul($punct|y)?\ $digit{1,2}(\,\ $digit{2,4})? | aug($punct|ust)?\ $digit{1,2}(\,\ $digit{2,4})? | sep($punct|tember)?\ $digit{1,2}(\,\ $digit{2,4})? | sept($punct)?\ $digit{1,2}(\,\ $digit{2,4})? | oct($punct|ober)?\ $digit{1,2}(\,\ $digit{2,4})? | nov($punct|ember)?\ $digit{1,2}(\,\ $digit{2,4})? | dec($punct|ember)?\ $digit{1,2}(\,\ $digit{2,4})?
@date2 = $digit{1,2} $punct $digit{1,2} $punct $digit{2,4}
@time = $digit{1,2} \: $digit{2} (am|pm)?
@word = $alphanum+
@number = $posneg? $digit+ | $posneg? $digit+ $dec $digit+ | $posneg? $digit+ (\,$digit{3})+ | $posneg? $digit? (\,$digit{3})+ $dec $digit+
$white = [\t\r\n\v\f\ ]
@doc = \< DOC \> @tag = \< $alphanum+ \> | \<\/ $alphanum+ \>
tokens :- @doc { \s -> "" } @tag ; $white+ ; @time { \s -> s } @number { \s -> s } @word { \s -> s } $punct ; . ;
{
printCount c [] = print c printCount c (l:ls) = if l == "" then printCount (c+1) ls else printCount c ls
main = do file <- readFile "trecfile1" printCount 0 (alexScanTokens file)
}
FTR, regular strings are lazy - too lazy, which is where the performance problems come from.
-- ------------------------------------------------------------------------------------------------------------ Version depending on ByteString.Lazy -- note that the grammar is the same, so it has been omitted -- ------------------------------------------------------------------------------------------------------------
... grammar ...
{ type AlexInput = (Char, -- previous char B.ByteString) -- current input string
takebytes :: Int -> B.ByteString -> String takebytes (0) _ = "" takebytes n s = c : takebytes (n-1) cs where c = B.index s 0 cs = B.drop 1 s
alexGetChar :: AlexInput -> Maybe (Char,AlexInput) alexGetChar (_, bytestring) | bytestring == B.empty = Nothing | otherwise = Just (c , (c,cs)) where c = B.index bytestring 0 cs = B.drop 1 bytestring
Hm, you might do better with more specialized functions.
alexGetChar :: AlexInput -> Maybe (Char,AlexInput) alexGetChar (_, bytestring) | B.null bytestring = Nothing | otherwise = Just (c , (c,cs)) where c = B.head bytestring cs = B.tail bytestring
or even:
alexGetChar :: AlexInput -> Maybe (Char,AlexInput) alexGetChar (_, bytestring) | B.null bytestring = Nothing | otherwise = Just (c , (c,cs)) where c = B.unsafeHead bytestring cs = B.unsafeTail bytestring
alexInputPrevChar :: AlexInput -> Char alexInputPrevChar (c,_) = c
If you are certian this isn't the first character, you might do better using B.unsafeIndex (-1).
alexScanTokens :: B.ByteString -> [String] alexScanTokens str = go ('\n',str) where go inp@(_,str) = case alexScan inp 0 of AlexToken inp' len act -> act (takebytes len str) : go inp' AlexSkip inp' len -> go inp' AlexEOF -> [] AlexError _ -> error "lexical error"
printCount :: Int -> [String] -> IO () printCount c [] = print c printCount c (l:ls) = if l == "" then printCount (c+1) ls else printCount c ls
main = do file <- B.readFile "trecfile1" printCount 0 (alexScanTokens file)
}

Yes, that was a typo :-) On Tuesday 13 February 2007 22:54, Stefan O'Rear wrote:
On Tue, Feb 13, 2007 at 10:43:11PM -0500, Jefferson Heard wrote:
I am running GHC 2.6 now, and am using -O3 as my optimization parameter. I'm
I think you will get much better performance with GHC 6.6. The optimizer has been improved a *lot* in the last 10 years.
(I hope that was a typo!!)
Non-lazy version
{ module Main where
import qualified FileReader
}
%wrapper "basic"
$letter = [a-zA-Z] $digit = 0-9 $alphanum = [a-zA-Z0-9] $punct = [\! \@ \# \$ \% \^ \& \* \( \) \_ \- \+ \= \{ \[ \} \] \\ \| \; \: \' \" \, \. \? \/ \` \~] $dec = \. $posneg = [\- \+]
@date1 = jan($punct|uary)?\ $digit{1,2}(\,\ $digit{2,4})?
| feb($punct|ruary)?\ $digit{1,2}(\,\ $digit{2,4})? | mar($punct|ch)?\ $digit{1,2}(\,\ $digit{2,4})? | apr($punct|il)?\ $digit{1,2}(\,\ $digit{2,4})? | may?\ $digit{1,2}(\,\ $digit{2,4})? | jun($punct|e)?\ $digit{1,2}(\,\ $digit{2,4})? | jul($punct|y)?\ $digit{1,2}(\,\ $digit{2,4})? | aug($punct|ust)?\ $digit{1,2}(\,\ $digit{2,4})? | sep($punct|tember)?\ $digit{1,2}(\,\ $digit{2,4})? | sept($punct)?\ $digit{1,2}(\,\ $digit{2,4})? | oct($punct|ober)?\ $digit{1,2}(\,\ $digit{2,4})? | nov($punct|ember)?\ $digit{1,2}(\,\ $digit{2,4})? | dec($punct|ember)?\ $digit{1,2}(\,\ $digit{2,4})?
@date2 = $digit{1,2} $punct $digit{1,2} $punct $digit{2,4}
@time = $digit{1,2} \: $digit{2} (am|pm)?
@word = $alphanum+
@number = $posneg? $digit+
| $posneg? $digit+ $dec $digit+ | $posneg? $digit+ (\,$digit{3})+ | $posneg? $digit? (\,$digit{3})+ $dec $digit+
$white = [\t\r\n\v\f\ ]
@doc = \< DOC \> @tag = \< $alphanum+ \>
| \<\/ $alphanum+ \>
tokens :- @doc { \s -> "" } @tag ; $white+ ; @time { \s -> s } @number { \s -> s } @word { \s -> s } $punct ; . ;
{
printCount c [] = print c printCount c (l:ls) = if l == "" then printCount (c+1) ls else printCount c ls
main = do file <- readFile "trecfile1" printCount 0 (alexScanTokens file)
}
FTR, regular strings are lazy - too lazy, which is where the performance problems come from.
-- ------------------------------------------------------------------------- ----------------------------------- Version depending on ByteString.Lazy -- note that the grammar is the same, so it has been omitted -- ------------------------------------------------------------------------- -----------------------------------
... grammar ...
{ type AlexInput = (Char, -- previous char B.ByteString) -- current input string
takebytes :: Int -> B.ByteString -> String takebytes (0) _ = "" takebytes n s = c : takebytes (n-1) cs where c = B.index s 0 cs = B.drop 1 s
alexGetChar :: AlexInput -> Maybe (Char,AlexInput) alexGetChar (_, bytestring)
| bytestring == B.empty = Nothing | otherwise = Just (c , (c,cs))
where c = B.index bytestring 0 cs = B.drop 1 bytestring
Hm, you might do better with more specialized functions.
alexGetChar :: AlexInput -> Maybe (Char,AlexInput) alexGetChar (_, bytestring)
| B.null bytestring = Nothing | otherwise = Just (c , (c,cs))
where c = B.head bytestring cs = B.tail bytestring
or even:
alexGetChar :: AlexInput -> Maybe (Char,AlexInput) alexGetChar (_, bytestring)
| B.null bytestring = Nothing | otherwise = Just (c , (c,cs))
where c = B.unsafeHead bytestring cs = B.unsafeTail bytestring
alexInputPrevChar :: AlexInput -> Char alexInputPrevChar (c,_) = c
If you are certian this isn't the first character, you might do better using B.unsafeIndex (-1).
alexScanTokens :: B.ByteString -> [String] alexScanTokens str = go ('\n',str) where go inp@(_,str) = case alexScan inp 0 of AlexToken inp' len act -> act (takebytes len str) : go inp' AlexSkip inp' len -> go inp' AlexEOF -> [] AlexError _ -> error "lexical error"
printCount :: Int -> [String] -> IO () printCount c [] = print c printCount c (l:ls) = if l == "" then printCount (c+1) ls else printCount c ls
main = do file <- B.readFile "trecfile1" printCount 0 (alexScanTokens file)
}

On a related topic, I think Duncan Coutts and Lennart Kolmodin have worked on adding ByteString support to Alex. It seems to be available in the current darcs version of Alex. You many want to check with them for more details. /Björn Jefferson Heard wrote:
It was suggested that I might derive some performance benefit from using lazy bytestrings in my tokenizer instead of regular strings. Here's the code that I've tried. Note that I've hacked the "basic" wrapper code in the Lazy version, so the code should be all but the same. The only thing I had to do out of the ordinary was write my own 'take' function instead of using the substring function provided by Data.Lazy.ByteString.Char8. The take function I used was derived from the one GHC uses in GHC.List and produces about the same code.
The non-lazy version runs in 38 seconds on a 211MB file versus the lazy versions 41 seconds. That of course doesn't seem like that much, and in the non-lazy case, I have to break the input up into multiple files, whereas I don't have to in the lazy version -- this does not take any extra time. The seconds do add up to a couple of hours for me, though once I'm done, and so I'd like to understand why, when the consensus was that Data.ByteString.Lazy might give me better performance in the end, it doesn't do so here.
I am running GHC 2.6 now, and am using -O3 as my optimization parameter. I'm profiling the code now, but was wondering if there was any insight...
-- Jeff
Non-lazy version
{ module Main where
import qualified FileReader
}
%wrapper "basic"
$letter = [a-zA-Z] $digit = 0-9 $alphanum = [a-zA-Z0-9] $punct = [\! \@ \# \$ \% \^ \& \* \( \) \_ \- \+ \= \{ \[ \} \] \\ \| \; \: \' \" \, \. \? \/ \` \~] $dec = \. $posneg = [\- \+]
@date1 = jan($punct|uary)?\ $digit{1,2}(\,\ $digit{2,4})? | feb($punct|ruary)?\ $digit{1,2}(\,\ $digit{2,4})? | mar($punct|ch)?\ $digit{1,2}(\,\ $digit{2,4})? | apr($punct|il)?\ $digit{1,2}(\,\ $digit{2,4})? | may?\ $digit{1,2}(\,\ $digit{2,4})? | jun($punct|e)?\ $digit{1,2}(\,\ $digit{2,4})? | jul($punct|y)?\ $digit{1,2}(\,\ $digit{2,4})? | aug($punct|ust)?\ $digit{1,2}(\,\ $digit{2,4})? | sep($punct|tember)?\ $digit{1,2}(\,\ $digit{2,4})? | sept($punct)?\ $digit{1,2}(\,\ $digit{2,4})? | oct($punct|ober)?\ $digit{1,2}(\,\ $digit{2,4})? | nov($punct|ember)?\ $digit{1,2}(\,\ $digit{2,4})? | dec($punct|ember)?\ $digit{1,2}(\,\ $digit{2,4})?
@date2 = $digit{1,2} $punct $digit{1,2} $punct $digit{2,4}
@time = $digit{1,2} \: $digit{2} (am|pm)?
@word = $alphanum+
@number = $posneg? $digit+ | $posneg? $digit+ $dec $digit+ | $posneg? $digit+ (\,$digit{3})+ | $posneg? $digit? (\,$digit{3})+ $dec $digit+
$white = [\t\r\n\v\f\ ]
@doc = \< DOC \> @tag = \< $alphanum+ \> | \<\/ $alphanum+ \>
tokens :- @doc { \s -> "" } @tag ; $white+ ; @time { \s -> s } @number { \s -> s } @word { \s -> s } $punct ; . ;
{
printCount c [] = print c printCount c (l:ls) = if l == "" then printCount (c+1) ls else printCount c ls
main = do file <- readFile "trecfile1" printCount 0 (alexScanTokens file)
}
-- ------------------------------------------------------------------------------------------------------------ Version depending on ByteString.Lazy -- note that the grammar is the same, so it has been omitted -- ------------------------------------------------------------------------------------------------------------
... grammar ...
{ type AlexInput = (Char, -- previous char B.ByteString) -- current input string
takebytes :: Int -> B.ByteString -> String takebytes (0) _ = "" takebytes n s = c : takebytes (n-1) cs where c = B.index s 0 cs = B.drop 1 s
alexGetChar :: AlexInput -> Maybe (Char,AlexInput) alexGetChar (_, bytestring) | bytestring == B.empty = Nothing | otherwise = Just (c , (c,cs)) where c = B.index bytestring 0 cs = B.drop 1 bytestring
alexInputPrevChar :: AlexInput -> Char alexInputPrevChar (c,_) = c
alexScanTokens :: B.ByteString -> [String] alexScanTokens str = go ('\n',str) where go inp@(_,str) = case alexScan inp 0 of AlexToken inp' len act -> act (takebytes len str) : go inp' AlexSkip inp' len -> go inp' AlexEOF -> [] AlexError _ -> error "lexical error"
printCount :: Int -> [String] -> IO () printCount c [] = print c printCount c (l:ls) = if l == "" then printCount (c+1) ls else printCount c ls
main = do file <- B.readFile "trecfile1" printCount 0 (alexScanTokens file)
} _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Tue, 2007-02-13 at 22:43 -0500, Jefferson Heard wrote:
It was suggested that I might derive some performance benefit from using lazy bytestrings in my tokenizer instead of regular strings. Here's the code that I've tried. Note that I've hacked the "basic" wrapper code in the Lazy version, so the code should be all but the same. The only thing I had to do out of the ordinary was write my own 'take' function instead of using the substring function provided by Data.Lazy.ByteString.Char8. The take function I used was derived from the one GHC uses in GHC.List and produces about the same code.
If you use the latest darcs version of alex and the "basic-bytestring" wrapper then you don't have to write any of your own take functions. Actually, lazy ByteString are still not as optimised as I would like. There are still too many indirections. That's something I'm working on at the moment. Duncan
participants (5)
-
Björn Bringert
-
dons@cse.unsw.edu.au
-
Duncan Coutts
-
Jefferson Heard
-
Stefan O'Rear