newbie questions (read, etc., with Data.ByteString.Lazy.Char8)

Hi, I could use a little help. I was looking through the Real World Haskell book and came across a trivial program for summing numbers in a file. They mentioned that that implementation was very slow, as it's based on String's, so I thought I'd try my hand at converting it to use lazy ByteString's. I've made some progress, but now I'm a little stuck because that module doesn't seem to have a 'read' method. There's a readInt method, which I guess I could use, but it returns a Maybe, and I don't see how I can easily strip that off. So: 1. Is there an easy way to strip off the Maybe that would allow an equivalently concise definition for sumFile? I can probably figure out how to do it with pattern matching and a separate function--I'm just wondering if there's a more concise way. 2. Why doesn't ByteString implement 'read'? Is it just that this function (like 'input' in Python) isn't really very useful for real programs? 3. Why doesn't ByteString implement 'readDouble', etc.? That is, why are Int and Integer treated specially? Do I not need readDouble? Thanks, Mike -- lazy version (doesn't compile) -- file: ch08/SumFile.hs import qualified Data.ByteString.Lazy.Char8 as L main = do contents <- L.getContents print (sumFile contents) where sumFile = sum . map read . L.words

On Mon, Oct 6, 2008 at 7:06 PM, Mike Coleman
Hi,
I could use a little help. I was looking through the Real World Haskell book and came across a trivial program for summing numbers in a file. They mentioned that that implementation was very slow, as it's based on String's, so I thought I'd try my hand at converting it to use lazy ByteString's. I've made some progress, but now I'm a little stuck because that module doesn't seem to have a 'read' method.
There's a readInt method, which I guess I could use, but it returns a Maybe, and I don't see how I can easily strip that off.
So:
1. Is there an easy way to strip off the Maybe that would allow an equivalently concise definition for sumFile? I can probably figure out how to do it with pattern matching and a separate function--I'm just wondering if there's a more concise way.
I'm not a ByteString expert, but there should be an easy way to solve this issue of Maybe. If you go to hoogle (http://www.haskell.org/hoogle/) and type this: [Maybe a] -> [a] it says: Data.Maybehttp://haskell.org/ghc/docs/latest/html/libraries/base/Data-Maybe.html .catMaybeshttp://haskell.org/ghc/docs/latest/html/libraries/base/Data-Maybe.html#v:cat...:: [Maybe a] -> [a]http://haskell.org/ghc/docs/latest/html/libraries/base/Data-Maybe.html#v:cat... As the top search result. This means that you can convert any list of maybes into a list of what you want. It just tosses out the Nothings.
2. Why doesn't ByteString implement 'read'? Is it just that this function (like 'input' in Python) isn't really very useful for real programs?
I think probably for things more complex than parsing ints it's best to make your own parser? I seem to recall that someone was working on a library of parsing functions based on bytestring? Maybe someone else can comment? 3. Why doesn't ByteString implement 'readDouble', etc.? That is, why
are Int and Integer treated specially? Do I not need readDouble?
I think readInt was mostly implemented because integer reading was needed a lot for benchmarks and programming challenge sites and people noticed it was way slower than needed so someone put in the effort it optimize it. Once it was optimized, that must have satisfied the need for fast number reading. I would agree that at least for Prelude types it would be nice to have efficient bytestring based parsers. Do we have Read/Show classes specifically for working in bytestrings? Maybe that's what the world needs in the bytestring api. Then again, I'm not really qualified to comment :) For all I know it already exists. Jason

dagit:
As the top search result.
This means that you can convert any list of maybes into a list of what you want. It just tosses out the Nothings.
2. Why doesn't ByteString implement 'read'? Is it just that this function (like 'input' in Python) isn't really very useful for real programs?
I think probably for things more complex than parsing ints it's best to make your own parser? I seem to recall that someone was working on a library of parsing functions based on bytestring? Maybe someone else can comment?
3. Why doesn't ByteString implement 'readDouble', etc.? That is, why are Int and Integer treated specially? Do I not need readDouble?
I think readInt was mostly implemented because integer reading was needed a lot for benchmarks and programming challenge sites and people noticed it was way slower than needed so someone put in the effort it optimize it. Once it was optimized, that must have satisfied the need for fast number reading.
There's readInt, readInteger and readDouble now, as primitives, because people ask for them.
I would agree that at least for Prelude types it would be nice to have efficient bytestring based parsers. Do we have Read/Show classes specifically for working in bytestrings? Maybe that's what the world needs in the bytestring api. Then again, I'm not really qualified to comment :) For all I know it already exists.
Data.Binary is the generic parser for bytestrings. A higher level one for parsing text (e.g. atto-parsec), might be worthwhile. -- Don

Am Dienstag, 7. Oktober 2008 04:21 schrieb Jason Dagit:
On Mon, Oct 6, 2008 at 7:06 PM, Mike Coleman
wrote: Hi,
I could use a little help. I was looking through the Real World Haskell book and came across a trivial program for summing numbers in a file. They mentioned that that implementation was very slow, as it's based on String's, so I thought I'd try my hand at converting it to use lazy ByteString's. I've made some progress, but now I'm a little stuck because that module doesn't seem to have a 'read' method.
There's a readInt method, which I guess I could use, but it returns a Maybe, and I don't see how I can easily strip that off.
So:
1. Is there an easy way to strip off the Maybe that would allow an equivalently concise definition for sumFile? I can probably figure out how to do it with pattern matching and a separate function--I'm just wondering if there's a more concise way.
I'm not a ByteString expert, but there should be an easy way to solve this issue of Maybe.
If you go to hoogle (http://www.haskell.org/hoogle/) and type this: [Maybe a] -> [a] it says: Data.Maybe<http://haskell.org/ghc/docs/latest/html/libraries/base/Data-Mayb e.html> .catMaybes<http://haskell.org/ghc/docs/latest/html/libraries/base/Data-Mayb e.html#v:catMaybes>:: [Maybe a] -> [a]<http://haskell.org/ghc/docs/latest/html/libraries/base/Data-Maybe.html# v:catMaybes>
As the top search result.
This means that you can convert any list of maybes into a list of what you want. It just tosses out the Nothings.
Since readInt returns a Maybe (Int,ByteString), Data.List.unfoldr would be a better fit for his needs. The bytestring-lexing package (http://hackage.haskell.org/packages/archive/bytestring-lexing/0.1.2/doc/html...) provides readDouble, which is also pretty fast, I think.
2. Why doesn't ByteString implement 'read'? Is it just that this function (like 'input' in Python) isn't really very useful for real programs?
I think probably for things more complex than parsing ints it's best to make your own parser? I seem to recall that someone was working on a library of parsing functions based on bytestring? Maybe someone else can comment?
At least parsec 3.0.0 has ByteString parsing modules (I've never used it, so I don't know how well they work). IIRC, there's a plan to expand the bytestring-lexing package, too.
3. Why doesn't ByteString implement 'readDouble', etc.? That is, why
are Int and Integer treated specially? Do I not need readDouble?
I think readInt was mostly implemented because integer reading was needed a lot for benchmarks and programming challenge sites and people noticed it was way slower than needed so someone put in the effort it optimize it. Once it was optimized, that must have satisfied the need for fast number reading.
More's underway, readDouble already delivered.
I would agree that at least for Prelude types it would be nice to have efficient bytestring based parsers. Do we have Read/Show classes specifically for working in bytestrings? Maybe that's what the world needs in the bytestring api. Then again, I'm not really qualified to comment :) For all I know it already exists.
partially.
Jason

Thanks for your replies. Hoogle is pretty cool--I didn't know about that. I ended up with this, which is pretty close to the original (it will also bomb if given non-integer input): import qualified Data.ByteString.Lazy.Char8 as L import qualified Data.Maybe as M main = do contents <- L.getContents print (sumFile contents) where sumFile = sum . map (fst . M.fromJust . L.readInt) . L.words One further problem I've encountered: My Haskell program runs under 'runghc', but I get a link error when compiling it (GHC 6.8.2, latest Ubuntu). Any suggestions? $ ghc -o LazySumFile LazySumFile.hs LazySumFile.o: In function `sOz_info': (.text+0x200): undefined reference to `bytestringzm0zi9zi0zi1_DataziByteStringziLazzyziChar8_readInt_closure' LazySumFile.o: In function `sOF_info': (.text+0x35d): undefined reference to `bytestringzm0zi9zi0zi1_DataziByteStringziLazzyziChar8_words_closure' LazySumFile.o: In function `sRF_info': (.text+0x4d7): undefined reference to `bytestringzm0zi9zi0zi1_DataziByteStringziLazzy_getContents_closure' LazySumFile.o: In function `sRF_info': (.text+0x64f): undefined reference to `__stginit_bytestringzm0zi9zi0zi1_DataziByteStringziLazzyziChar8_' LazySumFile.o: In function `sOF_srt': (.data+0xa0): undefined reference to `bytestringzm0zi9zi0zi1_DataziByteStringziLazzyziChar8_words_closure' LazySumFile.o: In function `sOF_srt': (.data+0xa8): undefined reference to `bytestringzm0zi9zi0zi1_DataziByteStringziLazzyziChar8_readInt_closure' LazySumFile.o: In function `rOh_closure': (.data+0x110): undefined reference to `bytestringzm0zi9zi0zi1_DataziByteStringziLazzy_getContents_closure' collect2: ld returned 1 exit status

tutufan:
Thanks for your replies. Hoogle is pretty cool--I didn't know about that.
I ended up with this, which is pretty close to the original (it will also bomb if given non-integer input):
import qualified Data.ByteString.Lazy.Char8 as L import qualified Data.Maybe as M
main = do contents <- L.getContents print (sumFile contents) where sumFile = sum . map (fst . M.fromJust . L.readInt) . L.words
One further problem I've encountered: My Haskell program runs under 'runghc', but I get a link error when compiling it (GHC 6.8.2, latest Ubuntu). Any suggestions?
Oh wow, runghc is an interpreter. It is on average about 30x slower than compiled code. To compile, try something like: ghc -O2 --make A.hs And enjoy the native codes. :) -- Don

On Mon, Oct 6, 2008 at 10:19 PM, Don Stewart
To compile, try something like:
ghc -O2 --make A.hs
That did the trick--thanks. For input large enough to be a good test, this is I/O-bound on my laptop (the speed of the compiled and interpreted versions are almost identical), so I'll have to try it on a bigger machine tomorrow. Mike

a slight modification to compile it :
change:
where sumFile = sum . map read . L.words
to :
where sumFile = sum . map (read . L.unpack) . L.words
but it's actually _slower_ than the non-bytestring version.
i did a little test, three versions of the same script and manufactured
meself a ~50 MB file containing 1M of ints 0-65535. and replaced the sum
with length for obvious reasons.
module Main where
import qualified Data.ByteString.Lazy.Char8 as L
main1 = do
contents <- L.getContents
print (sumFile contents)
where sumFile = length . map L.readInt . L.words
main2 = do
contents <- getContents
print (sumFile contents)
where sumFile = length . map (read :: String -> Int) . words
main3 = do
contents <- L.getContents
print (sumFile contents)
where sumFile = length . map ((read :: String -> Int) .
L.unpack) . L.words
time main3 < nums
real 0m22.421s
user 0m0.031s
sys 0m0.000s
time main2 < nums
real 0m14.296s
user 0m0.015s
sys 0m0.016s
time main1 < nums
real 0m22.078s
user 0m0.015s
sys 0m0.015s
i expected the conversions (L.unpack in main3) to kill the performance a
little, but not to make it nearly two times as slow.
and i certainly did not expect that even the version using the bytestring
readInt to be as slow ...
did I do something wrong ?
On Tue, Oct 7, 2008 at 4:06 AM, Mike Coleman
Hi,
I could use a little help. I was looking through the Real World Haskell book and came across a trivial program for summing numbers in a file. They mentioned that that implementation was very slow, as it's based on String's, so I thought I'd try my hand at converting it to use lazy ByteString's. I've made some progress, but now I'm a little stuck because that module doesn't seem to have a 'read' method.
There's a readInt method, which I guess I could use, but it returns a Maybe, and I don't see how I can easily strip that off.
So:
1. Is there an easy way to strip off the Maybe that would allow an equivalently concise definition for sumFile? I can probably figure out how to do it with pattern matching and a separate function--I'm just wondering if there's a more concise way.
2. Why doesn't ByteString implement 'read'? Is it just that this function (like 'input' in Python) isn't really very useful for real programs?
3. Why doesn't ByteString implement 'readDouble', etc.? That is, why are Int and Integer treated specially? Do I not need readDouble?
Thanks, Mike
-- lazy version (doesn't compile)
-- file: ch08/SumFile.hs
import qualified Data.ByteString.Lazy.Char8 as L
main = do contents <- L.getContents print (sumFile contents) where sumFile = sum . map read . L.words _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

666wman:
a slight modification to compile it :
change: where sumFile = sum . map read . L.words to : where sumFile = sum . map (read . L.unpack) . L.words
but it's actually _slower_ than the non-bytestring version.
Never unpack a bytestring. import qualified Data.ByteString.Lazy.Char8 as S main = print . go 0 =<< S.getContents where go n s = case S.readInt s of Nothing -> n Just (k,t) -> go (n+k) (S.tail t) Assuming you're reading int, integers or doubles.

the problem is that using readInt is actually _as slow_, at least using my
test script :-((
On Tue, Oct 7, 2008 at 5:12 AM, Don Stewart
666wman:
a slight modification to compile it :
change: where sumFile = sum . map read . L.words to : where sumFile = sum . map (read . L.unpack) . L.words
but it's actually _slower_ than the non-bytestring version.
Never unpack a bytestring.
import qualified Data.ByteString.Lazy.Char8 as S
main = print . go 0 =<< S.getContents where go n s = case S.readInt s of Nothing -> n Just (k,t) -> go (n+k) (S.tail t)
Assuming you're reading int, integers or doubles.

Hmm. How are you compiling it? Using bytestring 0.9.1.x ? Should be fast, http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all Assuming you're turning on optimisations ( ghc -O2 ) -- Don 666wman:
the problem is that using readInt is actually _as slow_, at least using my test script :-((
On Tue, Oct 7, 2008 at 5:12 AM, Don Stewart <[1]dons@galois.com> wrote:
666wman: > a slight modification to compile it : > > change: > where sumFile = sum . map read . L.words > to : > where sumFile = sum . map (read . L.unpack) . L.words > > but it's actually _slower_ than the non-bytestring version.
Never unpack a bytestring.
import qualified Data.ByteString.Lazy.Char8 as S
main = print . go 0 =<< S.getContents where go n s = case S.readInt s of Nothing -> n Just (k,t) -> go (n+k) (S.tail t)
Assuming you're reading int, integers or doubles.
References
Visible links 1. mailto:dons@galois.com

new figures, after updating bytestring (0.9.0.1.1 -> 0.9.1.2) && using -O2
time Main < nums
real 0m2.531s
user 0m0.015s
sys 0m0.015s
time Main2 < nums
real 0m13.999s
user 0m0.015s
sys 0m0.015s
time Main3 < nums
real 0m2.796s
user 0m0.015s
sys 0m0.015s
thats more like it, even the unpacking didn't hurt so much.
the morals: "Thou shalt update your libraries" & "Thou shalt not forget to
turn on optimizations" before bitching it's too slow ;-)))
thx
On Tue, Oct 7, 2008 at 5:19 AM, Don Stewart
Hmm. How are you compiling it? Using bytestring 0.9.1.x ?
Should be fast,
http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all
Assuming you're turning on optimisations ( ghc -O2 )
-- Don

just for the kicks i tried the new version of bytestring without -O2 and the results were even worse: time Main1 < nums real 0m40.092s user 0m0.015s sys 0m0.015s time Main3 < nums real 0m41.405s user 0m0.015s sys 0m0.015s it got pwned even by this very naive ruby scipt (which, btw, chewed through some ~600 MB of memory ;-)) File.open("nums","r") do |f| puts((f.read.split.each {|x| x.to_i }).length) end time ruby nums.rb real 0m21.609s user 0m0.015s sys 0m0.015s so it probably can't be stressed enough: repeat "-O2" and now seriously: is there a reason why -O2 shouldn't be made the default (and allowing to turn off optimizations by -O0 perhaps) ? On Tue, Oct 7, 2008 at 5:36 AM, wman <666wman@gmail.com> wrote:
new figures, after updating bytestring (0.9.0.1.1 -> 0.9.1.2) && using -O2
time Main < nums real 0m2.531s user 0m0.015s sys 0m0.015s
time Main2 < nums real 0m13.999s user 0m0.015s sys 0m0.015s
time Main3 < nums real 0m2.796s user 0m0.015s sys 0m0.015s
thats more like it, even the unpacking didn't hurt so much.
the morals: "Thou shalt update your libraries" & "Thou shalt not forget to turn on optimizations" before bitching it's too slow ;-)))
thx
On Tue, Oct 7, 2008 at 5:19 AM, Don Stewart
wrote: Hmm. How are you compiling it? Using bytestring 0.9.1.x ?
Should be fast,
http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all
Assuming you're turning on optimisations ( ghc -O2 )
-- Don

On 2008 Oct 6, at 23:54, wman wrote:
just for the kicks i tried the new version of bytestring without -O2 and the results were even worse:
Yep, ByteString does a lot of stuff that is really bad when GHC isn't performing stream fusion --- but when it is, the code compiles down to tight machine code that can rival C.
is there a reason why -O2 shouldn't be made the default (and allowing to turn off optimizations by -O0 perhaps) ?
-O2 breaks on some platforms, I think. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Hello Brandon, Tuesday, October 7, 2008, 7:59:06 AM, you wrote:
is there a reason why -O2 shouldn't be made the default (and allowing to turn off optimizations by -O0 perhaps) ?
it compiles ~2x slower and firces more recompilation (because it does inter-module inlining). so it's not perfect for edit-compile-debug runs -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

666wman:
just for the kicks i tried the new version of bytestring without -O2 and the results were even worse:
Note that without -O or -O2 no strictness analysis is performed. So that tail recursive loop ... won't be. You could try -Onot -fstrictness just for kicks, to see why strictness analysis is important when writing in a tail recursive style. -- Don

ghc -Onot -fstrictness --make Main1.hs && ghc -Onot -fstrictness --make
Main2.hs && ghc -Onot -fstrictness --make Main3.hs
time Main1 < nums
real 0m39.530s
user 0m0.015s
sys 0m0.030s
time Main2 < nums
real 0m14.078s
user 0m0.015s
sys 0m0.015s
time Main3.exe < nums
real 0m41.342s
user 0m0.015s
sys 0m0.015s
still, i'm going to google up strictness analysis to at least know what made
no difference in this case ;-)
btw, why is the example #2 (
http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=ghc&id=2)
(which kicks collective asses of all other participants) not considered in
the shootout ? Too much optimizations ?
On Tue, Oct 7, 2008 at 6:27 AM, Don Stewart
666wman:
just for the kicks i tried the new version of bytestring without -O2 and the results were even worse:
Note that without -O or -O2 no strictness analysis is performed. So that tail recursive loop ... won't be. You could try -Onot -fstrictness just for kicks, to see why strictness analysis is important when writing in a tail recursive style.
-- Don

666wman:
ghc -Onot -fstrictness --make Main1.hs && ghc -Onot -fstrictness --make Main2.hs && ghc -Onot -fstrictness --make Main3.hs
time Main1 < nums real 0m39.530s user 0m0.015s sys 0m0.030s
time Main2 < nums real 0m14.078s user 0m0.015s sys 0m0.015s
time Main3.exe < nums real 0m41.342s user 0m0.015s sys 0m0.015s
still, i'm going to google up strictness analysis to at least know what made no difference in this case ;-)
Oh, that's interesting. Did they actually recompile? (you used -fforce-recomp). Otherwise, might be an idea to use hpaste.org to paste teh Main1,2,3 source code, since I think people reading don't know what is represented in each program at this point. -- Don

Hello wman, Tuesday, October 7, 2008, 8:44:48 AM, you wrote:
btw, why is the example #2 (http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=ghc&id=2) (which kicks collective asses of all other participants) not considered in the shootout ? Too much optimizations ?
it's cheating - application code written in low-level style instead of using library functions (if you will write in the same style in C it will probably even several times faster) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Mon, 2008-10-06 at 21:06 -0500, Mike Coleman wrote:
There's a readInt method, which I guess I could use, but it returns a Maybe, and I don't see how I can easily strip that off.
You can use Data.Maybe's 'mapMaybe' function "The mapMaybe function is a version of map which can throw out elements. In particular, the functional argument returns something of type Maybe b. If this is Nothing, no element is added on to the result list. If it just Just b, then b is included in the result list." So we change this:
-- lazy version (doesn't compile)
-- file: ch08/SumFile.hs
import qualified Data.ByteString.Lazy.Char8 as L
main = do contents <- L.getContents print (sumFile contents) where sumFile = sum . map read . L.words
To this: import qualified Data.ByteString.Lazy.Char8 as B import Data.Maybe (mapMaybe) main = do contents <- B.getContents print (sumFile contents) where sumFile = sum . map fst . mapMaybe B.readInt . B.words Now, there is a problem with this; the original version will not read integers that are followed by garbage, whereas this one will. The difference is between readInt and read: (read "2a") will fail, (readInt "2a") will return Just (2,"a") readInt is really designed with parsers in mind. So, we need to filter out anything readInt returns with a non-empty string in the right part of the tuple. Ending up with this: sumFile = sum . map fst . filter (B.null . snd) . mapMaybe B.readInt . B.words Actually, after saying this, the original version probably just dies when given garbage :P
participants (8)
-
Brandon S. Allbery KF8NH
-
Bulat Ziganshin
-
Daniel Fischer
-
Don Stewart
-
George Pollard
-
Jason Dagit
-
Mike Coleman
-
wman