newbie questions (read, etc., with Data.ByteString.Lazy.Char8)

Mike Coleman

6 Oct 2008 6 Oct '08

10:06 p.m.

Hi, I could use a little help. I was looking through the Real World Haskell book and came across a trivial program for summing numbers in a file. They mentioned that that implementation was very slow, as it's based on String's, so I thought I'd try my hand at converting it to use lazy ByteString's. I've made some progress, but now I'm a little stuck because that module doesn't seem to have a 'read' method. There's a readInt method, which I guess I could use, but it returns a Maybe, and I don't see how I can easily strip that off. So: 1. Is there an easy way to strip off the Maybe that would allow an equivalently concise definition for sumFile? I can probably figure out how to do it with pattern matching and a separate function--I'm just wondering if there's a more concise way. 2. Why doesn't ByteString implement 'read'? Is it just that this function (like 'input' in Python) isn't really very useful for real programs? 3. Why doesn't ByteString implement 'readDouble', etc.? That is, why are Int and Integer treated specially? Do I not need readDouble? Thanks, Mike -- lazy version (doesn't compile) -- file: ch08/SumFile.hs import qualified Data.ByteString.Lazy.Char8 as L main = do contents <- L.getContents print (sumFile contents) where sumFile = sum . map read . L.words

Show replies by date

Jason Dagit

6 Oct 6 Oct

10:21 p.m.

On Mon, Oct 6, 2008 at 7:06 PM, Mike Coleman wrote:

...

Hi,

I could use a little help. I was looking through the Real World Haskell book and came across a trivial program for summing numbers in a file. They mentioned that that implementation was very slow, as it's based on String's, so I thought I'd try my hand at converting it to use lazy ByteString's. I've made some progress, but now I'm a little stuck because that module doesn't seem to have a 'read' method.

There's a readInt method, which I guess I could use, but it returns a Maybe, and I don't see how I can easily strip that off.

So:

1. Is there an easy way to strip off the Maybe that would allow an equivalently concise definition for sumFile? I can probably figure out how to do it with pattern matching and a separate function--I'm just wondering if there's a more concise way.

I'm not a ByteString expert, but there should be an easy way to solve this issue of Maybe. If you go to hoogle (http://www.haskell.org/hoogle/) and type this: [Maybe a] -> [a] it says: Data.Maybehttp://haskell.org/ghc/docs/latest/html/libraries/base/Data-Maybe.html .catMaybeshttp://haskell.org/ghc/docs/latest/html/libraries/base/Data-Maybe.html#v:cat...:: [Maybe a] -> [a]http://haskell.org/ghc/docs/latest/html/libraries/base/Data-Maybe.html#v:cat... As the top search result. This means that you can convert any list of maybes into a list of what you want. It just tosses out the Nothings.

...

2. Why doesn't ByteString implement 'read'? Is it just that this function (like 'input' in Python) isn't really very useful for real programs?

I think probably for things more complex than parsing ints it's best to make your own parser? I seem to recall that someone was working on a library of parsing functions based on bytestring? Maybe someone else can comment? 3. Why doesn't ByteString implement 'readDouble', etc.? That is, why

...

are Int and Integer treated specially? Do I not need readDouble?

I think readInt was mostly implemented because integer reading was needed a lot for benchmarks and programming challenge sites and people noticed it was way slower than needed so someone put in the effort it optimize it. Once it was optimized, that must have satisfied the need for fast number reading. I would agree that at least for Prelude types it would be nice to have efficient bytestring based parsers. Do we have Read/Show classes specifically for working in bytestrings? Maybe that's what the world needs in the bytestring api. Then again, I'm not really qualified to comment :) For all I know it already exists. Jason

Don Stewart

10:31 p.m.

dagit:

...

As the top search result.

This means that you can convert any list of maybes into a list of what you want. It just tosses out the Nothings.

2. Why doesn't ByteString implement 'read'? Is it just that this function (like 'input' in Python) isn't really very useful for real programs?

I think probably for things more complex than parsing ints it's best to make your own parser? I seem to recall that someone was working on a library of parsing functions based on bytestring? Maybe someone else can comment?

3. Why doesn't ByteString implement 'readDouble', etc.? That is, why are Int and Integer treated specially? Do I not need readDouble?

I think readInt was mostly implemented because integer reading was needed a lot for benchmarks and programming challenge sites and people noticed it was way slower than needed so someone put in the effort it optimize it. Once it was optimized, that must have satisfied the need for fast number reading.

There's readInt, readInteger and readDouble now, as primitives, because people ask for them.

...

I would agree that at least for Prelude types it would be nice to have efficient bytestring based parsers. Do we have Read/Show classes specifically for working in bytestrings? Maybe that's what the world needs in the bytestring api. Then again, I'm not really qualified to comment :) For all I know it already exists.

Data.Binary is the generic parser for bytestrings. A higher level one for parsing text (e.g. atto-parsec), might be worthwhile. -- Don

Daniel Fischer

10:35 p.m.

Am Dienstag, 7. Oktober 2008 04:21 schrieb Jason Dagit:

...

On Mon, Oct 6, 2008 at 7:06 PM, Mike Coleman wrote:

...
Hi,

I could use a little help. I was looking through the Real World Haskell book and came across a trivial program for summing numbers in a file. They mentioned that that implementation was very slow, as it's based on String's, so I thought I'd try my hand at converting it to use lazy ByteString's. I've made some progress, but now I'm a little stuck because that module doesn't seem to have a 'read' method.

There's a readInt method, which I guess I could use, but it returns a Maybe, and I don't see how I can easily strip that off.

So:

1. Is there an easy way to strip off the Maybe that would allow an equivalently concise definition for sumFile? I can probably figure out how to do it with pattern matching and a separate function--I'm just wondering if there's a more concise way.

I'm not a ByteString expert, but there should be an easy way to solve this issue of Maybe.

If you go to hoogle (http://www.haskell.org/hoogle/) and type this: [Maybe a] -> [a] it says: Data.Maybe<http://haskell.org/ghc/docs/latest/html/libraries/base/Data-Mayb e.html> .catMaybes<http://haskell.org/ghc/docs/latest/html/libraries/base/Data-Mayb e.html#v:catMaybes>:: [Maybe a] -> [a]<http://haskell.org/ghc/docs/latest/html/libraries/base/Data-Maybe.html# v:catMaybes>

As the top search result.

This means that you can convert any list of maybes into a list of what you want. It just tosses out the Nothings.

Since readInt returns a Maybe (Int,ByteString), Data.List.unfoldr would be a better fit for his needs. The bytestring-lexing package (http://hackage.haskell.org/packages/archive/bytestring-lexing/0.1.2/doc/html...) provides readDouble, which is also pretty fast, I think.

...

...
2. Why doesn't ByteString implement 'read'? Is it just that this function (like 'input' in Python) isn't really very useful for real programs?

I think probably for things more complex than parsing ints it's best to make your own parser? I seem to recall that someone was working on a library of parsing functions based on bytestring? Maybe someone else can comment?

At least parsec 3.0.0 has ByteString parsing modules (I've never used it, so I don't know how well they work). IIRC, there's a plan to expand the bytestring-lexing package, too.

...

3. Why doesn't ByteString implement 'readDouble', etc.? That is, why

...
are Int and Integer treated specially? Do I not need readDouble?

I think readInt was mostly implemented because integer reading was needed a lot for benchmarks and programming challenge sites and people noticed it was way slower than needed so someone put in the effort it optimize it. Once it was optimized, that must have satisfied the need for fast number reading.

More's underway, readDouble already delivered.

...

I would agree that at least for Prelude types it would be nice to have efficient bytestring based parsers. Do we have Read/Show classes specifically for working in bytestrings? Maybe that's what the world needs in the bytestring api. Then again, I'm not really qualified to comment :) For all I know it already exists.

partially.

...

Jason

Mike Coleman

11:14 p.m.

Thanks for your replies. Hoogle is pretty cool--I didn't know about that. I ended up with this, which is pretty close to the original (it will also bomb if given non-integer input): import qualified Data.ByteString.Lazy.Char8 as L import qualified Data.Maybe as M main = do contents <- L.getContents print (sumFile contents) where sumFile = sum . map (fst . M.fromJust . L.readInt) . L.words One further problem I've encountered: My Haskell program runs under 'runghc', but I get a link error when compiling it (GHC 6.8.2, latest Ubuntu). Any suggestions? $ ghc -o LazySumFile LazySumFile.hs LazySumFile.o: In function `sOz_info': (.text+0x200): undefined reference to `bytestringzm0zi9zi0zi1_DataziByteStringziLazzyziChar8_readInt_closure' LazySumFile.o: In function `sOF_info': (.text+0x35d): undefined reference to `bytestringzm0zi9zi0zi1_DataziByteStringziLazzyziChar8_words_closure' LazySumFile.o: In function `sRF_info': (.text+0x4d7): undefined reference to `bytestringzm0zi9zi0zi1_DataziByteStringziLazzy_getContents_closure' LazySumFile.o: In function `sRF_info': (.text+0x64f): undefined reference to `__stginit_bytestringzm0zi9zi0zi1_DataziByteStringziLazzyziChar8_' LazySumFile.o: In function `sOF_srt': (.data+0xa0): undefined reference to `bytestringzm0zi9zi0zi1_DataziByteStringziLazzyziChar8_words_closure' LazySumFile.o: In function `sOF_srt': (.data+0xa8): undefined reference to `bytestringzm0zi9zi0zi1_DataziByteStringziLazzyziChar8_readInt_closure' LazySumFile.o: In function `rOh_closure': (.data+0x110): undefined reference to `bytestringzm0zi9zi0zi1_DataziByteStringziLazzy_getContents_closure' collect2: ld returned 1 exit status

Don Stewart

11:19 p.m.

tutufan:

...

Thanks for your replies. Hoogle is pretty cool--I didn't know about that.

I ended up with this, which is pretty close to the original (it will also bomb if given non-integer input):

import qualified Data.ByteString.Lazy.Char8 as L import qualified Data.Maybe as M

main = do contents <- L.getContents print (sumFile contents) where sumFile = sum . map (fst . M.fromJust . L.readInt) . L.words

One further problem I've encountered: My Haskell program runs under 'runghc', but I get a link error when compiling it (GHC 6.8.2, latest Ubuntu). Any suggestions?

Oh wow, runghc is an interpreter. It is on average about 30x slower than compiled code. To compile, try something like: ghc -O2 --make A.hs And enjoy the native codes. :) -- Don

Mike Coleman

7 Oct 7 Oct

12:17 a.m.

On Mon, Oct 6, 2008 at 10:19 PM, Don Stewart wrote:

...

To compile, try something like:

ghc -O2 --make A.hs

That did the trick--thanks. For input large enough to be a good test, this is I/O-bound on my laptop (the speed of the compiled and interpreted versions are almost identical), so I'll have to try it on a bigger machine tomorrow. Mike

wman

6 Oct 6 Oct

11:07 p.m.

a slight modification to compile it : change: where sumFile = sum . map read . L.words to : where sumFile = sum . map (read . L.unpack) . L.words but it's actually _slower_ than the non-bytestring version. i did a little test, three versions of the same script and manufactured meself a ~50 MB file containing 1M of ints 0-65535. and replaced the sum with length for obvious reasons. module Main where import qualified Data.ByteString.Lazy.Char8 as L main1 = do contents <- L.getContents print (sumFile contents) where sumFile = length . map L.readInt . L.words main2 = do contents <- getContents print (sumFile contents) where sumFile = length . map (read :: String -> Int) . words main3 = do contents <- L.getContents print (sumFile contents) where sumFile = length . map ((read :: String -> Int) . L.unpack) . L.words time main3 < nums real 0m22.421s user 0m0.031s sys 0m0.000s time main2 < nums real 0m14.296s user 0m0.015s sys 0m0.016s time main1 < nums real 0m22.078s user 0m0.015s sys 0m0.015s i expected the conversions (L.unpack in main3) to kill the performance a little, but not to make it nearly two times as slow. and i certainly did not expect that even the version using the bytestring readInt to be as slow ... did I do something wrong ? On Tue, Oct 7, 2008 at 4:06 AM, Mike Coleman wrote:

...

Hi,

I could use a little help. I was looking through the Real World Haskell book and came across a trivial program for summing numbers in a file. They mentioned that that implementation was very slow, as it's based on String's, so I thought I'd try my hand at converting it to use lazy ByteString's. I've made some progress, but now I'm a little stuck because that module doesn't seem to have a 'read' method.

There's a readInt method, which I guess I could use, but it returns a Maybe, and I don't see how I can easily strip that off.

So:

1. Is there an easy way to strip off the Maybe that would allow an equivalently concise definition for sumFile? I can probably figure out how to do it with pattern matching and a separate function--I'm just wondering if there's a more concise way.

2. Why doesn't ByteString implement 'read'? Is it just that this function (like 'input' in Python) isn't really very useful for real programs?

3. Why doesn't ByteString implement 'readDouble', etc.? That is, why are Int and Integer treated specially? Do I not need readDouble?

Thanks, Mike

-- lazy version (doesn't compile)

-- file: ch08/SumFile.hs

import qualified Data.ByteString.Lazy.Char8 as L

main = do contents <- L.getContents print (sumFile contents) where sumFile = sum . map read . L.words _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Don Stewart

11:12 p.m.

666wman:

...

a slight modification to compile it :

change: where sumFile = sum . map read . L.words to : where sumFile = sum . map (read . L.unpack) . L.words

but it's actually _slower_ than the non-bytestring version.

Never unpack a bytestring. import qualified Data.ByteString.Lazy.Char8 as S main = print . go 0 =<< S.getContents where go n s = case S.readInt s of Nothing -> n Just (k,t) -> go (n+k) (S.tail t) Assuming you're reading int, integers or doubles.

wman

11:17 p.m.

the problem is that using readInt is actually _as slow_, at least using my test script :-(( On Tue, Oct 7, 2008 at 5:12 AM, Don Stewart wrote:

...

666wman:

...
a slight modification to compile it :

change: where sumFile = sum . map read . L.words to : where sumFile = sum . map (read . L.unpack) . L.words

but it's actually _slower_ than the non-bytestring version.

Never unpack a bytestring.

import qualified Data.ByteString.Lazy.Char8 as S

main = print . go 0 =<< S.getContents where go n s = case S.readInt s of Nothing -> n Just (k,t) -> go (n+k) (S.tail t)

Assuming you're reading int, integers or doubles.

Don Stewart

11:19 p.m.

Hmm. How are you compiling it? Using bytestring 0.9.1.x ? Should be fast, http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all Assuming you're turning on optimisations ( ghc -O2 ) -- Don 666wman:

...

the problem is that using readInt is actually _as slow_, at least using my test script :-((

On Tue, Oct 7, 2008 at 5:12 AM, Don Stewart <[1]dons@galois.com> wrote:

666wman: > a slight modification to compile it : > > change: > where sumFile = sum . map read . L.words > to : > where sumFile = sum . map (read . L.unpack) . L.words > > but it's actually _slower_ than the non-bytestring version.

Never unpack a bytestring.

import qualified Data.ByteString.Lazy.Char8 as S

main = print . go 0 =<< S.getContents where go n s = case S.readInt s of Nothing -> n Just (k,t) -> go (n+k) (S.tail t)

Assuming you're reading int, integers or doubles.

References

Visible links 1. mailto:dons@galois.com

wman

11:36 p.m.

new figures, after updating bytestring (0.9.0.1.1 -> 0.9.1.2) && using -O2 time Main < nums real 0m2.531s user 0m0.015s sys 0m0.015s time Main2 < nums real 0m13.999s user 0m0.015s sys 0m0.015s time Main3 < nums real 0m2.796s user 0m0.015s sys 0m0.015s thats more like it, even the unpacking didn't hurt so much. the morals: "Thou shalt update your libraries" & "Thou shalt not forget to turn on optimizations" before bitching it's too slow ;-))) thx On Tue, Oct 7, 2008 at 5:19 AM, Don Stewart wrote:

...

Hmm. How are you compiling it? Using bytestring 0.9.1.x ?

Should be fast,

http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all

Assuming you're turning on optimisations ( ghc -O2 )

-- Don

wman

11:54 p.m.

just for the kicks i tried the new version of bytestring without -O2 and the results were even worse: time Main1 < nums real 0m40.092s user 0m0.015s sys 0m0.015s time Main3 < nums real 0m41.405s user 0m0.015s sys 0m0.015s it got pwned even by this very naive ruby scipt (which, btw, chewed through some ~600 MB of memory ;-)) File.open("nums","r") do |f| puts((f.read.split.each {|x| x.to_i }).length) end time ruby nums.rb real 0m21.609s user 0m0.015s sys 0m0.015s so it probably can't be stressed enough: repeat "-O2" and now seriously: is there a reason why -O2 shouldn't be made the default (and allowing to turn off optimizations by -O0 perhaps) ? On Tue, Oct 7, 2008 at 5:36 AM, wman <666wman@gmail.com> wrote:

...

new figures, after updating bytestring (0.9.0.1.1 -> 0.9.1.2) && using -O2

time Main < nums real 0m2.531s user 0m0.015s sys 0m0.015s

time Main2 < nums real 0m13.999s user 0m0.015s sys 0m0.015s

time Main3 < nums real 0m2.796s user 0m0.015s sys 0m0.015s

thats more like it, even the unpacking didn't hurt so much.

the morals: "Thou shalt update your libraries" & "Thou shalt not forget to turn on optimizations" before bitching it's too slow ;-)))

thx

On Tue, Oct 7, 2008 at 5:19 AM, Don Stewart wrote:

...
Hmm. How are you compiling it? Using bytestring 0.9.1.x ?

Should be fast,

http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all

Assuming you're turning on optimisations ( ghc -O2 )

-- Don

Brandon S. Allbery KF8NH

11:59 p.m.

On 2008 Oct 6, at 23:54, wman wrote:

...

just for the kicks i tried the new version of bytestring without -O2 and the results were even worse:

Yep, ByteString does a lot of stuff that is really bad when GHC isn't performing stream fusion --- but when it is, the code compiles down to tight machine code that can rival C.

...

is there a reason why -O2 shouldn't be made the default (and allowing to turn off optimizations by -O0 perhaps) ?

-O2 breaks on some platforms, I think. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Bulat Ziganshin

7 Oct 7 Oct

6:25 a.m.

New subject: Re[2]: newbie questions (read, etc., with Data.ByteString.Lazy.Char8)

Hello Brandon, Tuesday, October 7, 2008, 7:59:06 AM, you wrote:

...

is there a reason why -O2 shouldn't be made the default (and allowing to turn off optimizations by -O0 perhaps) ?

it compiles ~2x slower and firces more recompilation (because it does inter-module inlining). so it's not perfect for edit-compile-debug runs -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Don Stewart

12:27 a.m.

666wman:

...

just for the kicks i tried the new version of bytestring without -O2 and the results were even worse:

Note that without -O or -O2 no strictness analysis is performed. So that tail recursive loop ... won't be. You could try -Onot -fstrictness just for kicks, to see why strictness analysis is important when writing in a tail recursive style. -- Don

wman

12:44 a.m.

ghc -Onot -fstrictness --make Main1.hs && ghc -Onot -fstrictness --make Main2.hs && ghc -Onot -fstrictness --make Main3.hs time Main1 < nums real 0m39.530s user 0m0.015s sys 0m0.030s time Main2 < nums real 0m14.078s user 0m0.015s sys 0m0.015s time Main3.exe < nums real 0m41.342s user 0m0.015s sys 0m0.015s still, i'm going to google up strictness analysis to at least know what made no difference in this case ;-) btw, why is the example #2 ( http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=ghc&id=2) (which kicks collective asses of all other participants) not considered in the shootout ? Too much optimizations ? On Tue, Oct 7, 2008 at 6:27 AM, Don Stewart wrote:

...

666wman:

...
just for the kicks i tried the new version of bytestring without -O2 and the results were even worse:

Note that without -O or -O2 no strictness analysis is performed. So that tail recursive loop ... won't be. You could try -Onot -fstrictness just for kicks, to see why strictness analysis is important when writing in a tail recursive style.

-- Don

Don Stewart

12:50 a.m.

666wman:

...

ghc -Onot -fstrictness --make Main1.hs && ghc -Onot -fstrictness --make Main2.hs && ghc -Onot -fstrictness --make Main3.hs

time Main1 < nums real 0m39.530s user 0m0.015s sys 0m0.030s

time Main2 < nums real 0m14.078s user 0m0.015s sys 0m0.015s

time Main3.exe < nums real 0m41.342s user 0m0.015s sys 0m0.015s

still, i'm going to google up strictness analysis to at least know what made no difference in this case ;-)

Oh, that's interesting. Did they actually recompile? (you used -fforce-recomp). Otherwise, might be an idea to use hpaste.org to paste teh Main1,2,3 source code, since I think people reading don't know what is represented in each program at this point. -- Don

Bulat Ziganshin

6:28 a.m.

New subject: Re[2]: newbie questions (read, etc., with Data.ByteString.Lazy.Char8)

Hello wman, Tuesday, October 7, 2008, 8:44:48 AM, you wrote:

...

btw, why is the example #2 (http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=ghc&id=2) (which kicks collective asses of all other participants) not considered in the shootout ? Too much optimizations ?

it's cheating - application code written in low-level style instead of using library functions (if you will write in the same style in C it will probably even several times faster) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

George Pollard

5:19 p.m.

New subject: newbie questions (read, etc., with Data.ByteString.Lazy.Char8)

On Mon, 2008-10-06 at 21:06 -0500, Mike Coleman wrote:

...

There's a readInt method, which I guess I could use, but it returns a Maybe, and I don't see how I can easily strip that off.

You can use Data.Maybe's 'mapMaybe' function "The mapMaybe function is a version of map which can throw out elements. In particular, the functional argument returns something of type Maybe b. If this is Nothing, no element is added on to the result list. If it just Just b, then b is included in the result list." So we change this:

...

-- lazy version (doesn't compile)

-- file: ch08/SumFile.hs

import qualified Data.ByteString.Lazy.Char8 as L

main = do contents <- L.getContents print (sumFile contents) where sumFile = sum . map read . L.words

To this: import qualified Data.ByteString.Lazy.Char8 as B import Data.Maybe (mapMaybe) main = do contents <- B.getContents print (sumFile contents) where sumFile = sum . map fst . mapMaybe B.readInt . B.words Now, there is a problem with this; the original version will not read integers that are followed by garbage, whereas this one will. The difference is between readInt and read: (read "2a") will fail, (readInt "2a") will return Just (2,"a") readInt is really designed with parsers in mind. So, we need to filter out anything readInt returns with a non-empty string in the right part of the tuple. Ending up with this: sumFile = sum . map fst . filter (B.null . snd) . mapMaybe B.readInt . B.words Actually, after saying this, the original version probably just dies when given garbage :P

6072

Age (days ago)

6072

Last active (days ago)

List overview

Download

19 comments

8 participants

participants (8)

Brandon S. Allbery KF8NH
Bulat Ziganshin
Daniel Fischer
Don Stewart
George Pollard
Jason Dagit
Mike Coleman
wman