Re: [Haskell-cafe] Speed of character reading in Haskell

I started with the obvious main = getContents >>= print . tokenise where tokenise maps its way down a list of characters. This is very simple, very pleasant, and worked like a charm. However, the language has an INCLUDE directive, so I'm going to have to call readFile or something in the middle of tokenising, so the main tokeniser loop can't be a pure String -> [Token] function any more.
What about
tokenise :: [String] -> ([Token],[FilePath]) main = print . fst =<< mfix process where process (tokens,paths) = do mainContents <- getContents includes <- mapM readFile paths return $ tokenise $ mainContents : includes
I guess, it would be useful to replace ([Token],[FilePath]) with Writer [FilePath] [Token]
Method 1A (pure list processing) main = getContents >>= print . doit 0 doit n ('\n':cs) = doit (n+1) cs doit n ( _ :cs) = doit n cs doit n [] = n
I think, you should use something like (doit $! n+1) cs here.
In *retrospect*, it is really obvious why this was necessary, but I must say that in *prospect* I wasn't expecting it.
In fact, I was expecting this to be an issue even for 1A. I suppose, GHC is smart enough to suppress lazyness in the first method.

From what I can see of your program, it would greatly benefit from using Data.ByteString, is there an obvious reason not to use it ? (Some list operations are too expensive with ByteString but for most string processing it's perfectly fine and much faster than String).
-- Jedaï

On 7 Sep 2007, at 11:22 pm, Chaddaï Fouché wrote:
From what I can see of your program, it would greatly benefit from using Data.ByteString, is there an obvious reason not to use it ?
I am writing a a set of tools to process a legacy programming language, as I said. Speed is not, in fact, a central issue. It's just something that came up while I was exploring ways to express it. What *is* a central issue is getting something *right*, *readable*, *rapidly*. Using getContents and walking down a list I get something which is close to Lex in compactness but is much easier to work with; Lex has a nasty habit of either failing to cope entirely or requiring seriously nasty state hacking when you do anything even a little out of the way; Haskell list processing doesn't. Also, I greatly prefer writing to standards. While I turn to GHC for speed (when I can; trying to install GHC 6.6 under Solaris 2.9 was an amazingly painful and ultimately unsuccessful experience), I try to use other Haskell compilers as well, and that means sticking to standard libraries. Data.ByteString is many things, but defined in the Haskell 98 report is not one of them.
(Some list operations are too expensive with ByteString but for most string processing it's perfectly fine and much faster than String).
I'm sure it's true, but it's quite irrelevant to my question, which is "why is using getChar so much slower than using getContents"?

Hi
(Some list operations are too expensive with ByteString but for most string processing it's perfectly fine and much faster than String).
I'm sure it's true, but it's quite irrelevant to my question, which is "why is using getChar so much slower than using getContents"?
Buffering, blocks and locks. Buffering: getChar demands to get a character now, which pretty much means you can't buffer. Blocks: getContents reads blocks at a time from the underlying library, whereas getChar has to do one character at a time. Locks: getChar has to acquire locks, as does getContents. However, because getContents can operate on blocks, this requires many fewer locks. Thanks Neil

On Mon, 2007-09-10 at 00:49 +0100, Neil Mitchell wrote:
Hi
(Some list operations are too expensive with ByteString but for most string processing it's perfectly fine and much faster than String).
I'm sure it's true, but it's quite irrelevant to my question, which is "why is using getChar so much slower than using getContents"?
Buffering, blocks and locks.
Buffering: getChar demands to get a character now, which pretty much means you can't buffer.
Blocks: getContents reads blocks at a time from the underlying library, whereas getChar has to do one character at a time.
Locks: getChar has to acquire locks, as does getContents. However, because getContents can operate on blocks, this requires many fewer locks.
A little more and that would have Dr. Seuss-esque.

On 10 Sep 2007, at 11:49 am, Neil Mitchell wrote:
Buffering, blocks and locks.
Buffering: getChar demands to get a character now, which pretty much means you can't buffer.
Eh what? getchar() in C "demands to get a character now", which is fully compatible with as much buffering as you want. In fact, looking at the GHC 6.6 sources, I see that getChar -> hGetChar which *does* do buffering.
Blocks: getContents reads blocks at a time from the underlying library, whereas getChar has to do one character at a time.
Again, getchar() in C *returns* one character at a time, but gets oodles of character from the underlying library (read(), as it happens). As noted above, GHC 6.6 *does* read blocks from the underlying library.
Locks: getChar has to acquire locks, as does getContents. However, because getContents can operate on blocks, this requires many fewer locks.
What's to lock against? I'm writing single-threaded code. As for getContents, "each time these lazy read functions are pulled on, they have to check whether the handle has indeed been closed", which is not entirely unlike locking. While getContents / hGetContents is not defined directly in terms of getChar / hGetChar, the implementation *does* do much the same things. (Same blocking and buffering; for locks see previous paragraph.) So it was natural to wonder whether the difference was in *my* code and if so, what I was doing wrong.

Hello ok, Monday, September 10, 2007, 6:09:27 AM, you wrote:
Locks: getChar has to acquire locks, as does getContents. However, because getContents can operate on blocks, this requires many fewer locks.
What's to lock against? I'm writing single-threaded code.
unfortunately, there is one common code which used for any i/o. you may look at http://haskell.org/haskellwiki/Library/Streams which allows to disable locking and thus provides very fast getChar implementation -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Neil, Monday, September 10, 2007, 3:49:06 AM, you wrote:
I'm sure it's true, but it's quite irrelevant to my question, which is "why is using getChar so much slower than using getContents"?
Buffering, blocks and locks.
in ghc, is entirely due to locking, which is the slowest operation in whole hGetChar algorithm
Buffering: getChar demands to get a character now, which pretty much means you can't buffer.
both uses the same buffering (512 bytes)
Blocks: getContents reads blocks at a time from the underlying library, whereas getChar has to do one character at a time.
i don't see difference between buffering and blocking
Locks: getChar has to acquire locks, as does getContents. However, because getContents can operate on blocks, this requires many fewer locks.
getContents doesn't lock Handle because it is supposed to use it in exclusive manner -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (6)
-
Bulat Ziganshin
-
Chaddaï Fouché
-
Derek Elkins
-
Miguel Mitrofanov
-
Neil Mitchell
-
ok