performance question

older
When the gauge depends on what it...

Nicolas Bock

8 Feb 2013 8 Feb '13

7:26 p.m.

Hi list, I wrote a script that reads matrix elements from standard input, parses the input using a regular expression, and then bins the matrix elements by magnitude. I wrote the same script in python (just to be sure :) ) and find that the python version vastly outperforms the Haskell script. To be concrete: $ time ./createMatrixDump.py -N 128 | ./printMatrixDecay real 0m2.655s user 0m2.677s sys 0m0.095s $ time ./createMatrixDump.py -N 128 | ./printMatrixDecay.py - real 0m0.445s user 0m0.615s sys 0m0.032s The Haskell script was compiled with "ghc --make printMatrixDecay.hs". Could you have a look at the script and give me some pointers as to where I could improve it, both in terms of performance and also generally, as I am very new to Haskell. Thanks already, nick

Attachments:

attachment.html (text/html — 1.2 KB)
printMatrixDecay.hs (application/octet-stream — 2.0 KB)

Show replies by date

Aleksey Khudyakov

8 Feb 8 Feb

8:23 p.m.

On 08.02.2013 23:26, Nicolas Bock wrote:

...

Hi list,

I wrote a script that reads matrix elements from standard input, parses the input using a regular expression, and then bins the matrix elements by magnitude. I wrote the same script in python (just to be sure :) ) and find that the python version vastly outperforms the Haskell script.

General performance hints 1) Strings are slow. Fast alternatives are text[1] for textual data and bytestrings[2] for binary data. I can't say anything about performance of Text.Regex.Posix. 2) Appending list wrong operation to do in performance sensitive code. (++) traverses its first argument so it's O(n) in its length. What exactly are you tryeing to do? Create a histogram?

...

The Haskell script was compiled with "ghc --make printMatrixDecay.hs".

If you want performance you absolutely should use -O2. [1] http://hackage.haskell.org/package/text [2] http://hackage.haskell.org/package/bytestring

Nicolas Bock

9:02 p.m.

On Fri, Feb 8, 2013 at 1:23 PM, Aleksey Khudyakov

...

wrote:

...

On 08.02.2013 23:26, Nicolas Bock wrote:

...
Hi list,

I wrote a script that reads matrix elements from standard input, parses the input using a regular expression, and then bins the matrix elements by magnitude. I wrote the same script in python (just to be sure :) ) and find that the python version vastly outperforms the Haskell script.

General performance hints

1) Strings are slow. Fast alternatives are text[1] for textual data and bytestrings[2] for binary data. I can't say anything about performance of Text.Regex.Posix.

Thanks for the suggestion, I will try that.

...

2) Appending list wrong operation to do in performance sensitive code. (++) traverses its first argument so it's O(n) in its length.

What exactly are you tryeing to do? Create a histogram?

Yes, a histogram. The binning code is really a little awkward. I haven't gotten used to thinking in terms of inmutable objects yet and this list appending is really a pretty bad hack to kind of allow me to increment the bin counts. How would one do this more haskellishish?

...

The Haskell script was compiled with "ghc --make printMatrixDecay.hs".

...
If you want performance you absolutely should use -O2.

I'll try that.

...

[1] http://hackage.haskell.org/**package/text http://hackage.haskell.org/package/text [2] http://hackage.haskell.org/**package/bytestring http://hackage.haskell.org/package/bytestring

______________________________**_________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/**mailman/listinfo/haskell-cafe http://www.haskell.org/mailman/listinfo/haskell-cafe

Aleksey Khudyakov

9:34 p.m.

On 09.02.2013 01:02, Nicolas Bock wrote:

...

Yes, a histogram. The binning code is really a little awkward. I haven't gotten used to thinking in terms of inmutable objects yet and this list appending is really a pretty bad hack to kind of allow me to increment the bin counts. How would one do this more haskellishish?

Histogramming is a bit awkward in haskell. If we want to stick to immutable data types best choice is to have map bin → number of entries. For every new entry select bin and add 1 to its content. But if we want to store data in array we have to deal with mutable state. It's not realistic to copy array on every update. For that case I wrote I library histogram-fill[1]. Below is program which does approximately same thing.

...

import Data.Histogram.Fill import Data.Histogram (Histogram)

hb :: HBuilder String (Histogram LogBinD Int) hb = forceInt -<< mkSimple (logBinDN 1e-8 10 10) <<- read

main :: IO () main = do l <- getContents print $ fillBuilder hb $ lines l

I cheated and used sed to strip unused data. It uses String so it's still slower than python. $ time (python gen.py -N 300 | sed 's/.*=//' | ./printMatrixDecay ) real 0m0.958s user 0m2.096s sys 0m0.052s $ time (python gen.py -N 300 | python printMatrixDecay.py -) real 0m0.590s user 0m0.952s sys 0m0.016s [1] http://hackage.haskell.org/package/histogram-fill

Nicolas Bock

9 Feb 9 Feb

10:30 p.m.

On Fri, Feb 8, 2013 at 1:23 PM, Aleksey Khudyakov

...

wrote:

...

On 08.02.2013 23:26, Nicolas Bock wrote:

...
Hi list,

I wrote a script that reads matrix elements from standard input, parses the input using a regular expression, and then bins the matrix elements by magnitude. I wrote the same script in python (just to be sure :) ) and find that the python version vastly outperforms the Haskell script.

General performance hints

1) Strings are slow. Fast alternatives are text[1] for textual data and bytestrings[2] for binary data. I can't say anything about performance of Text.Regex.Posix.

Hi Aleksey,

could you show me how I would use ByteString? I can't get the script to compile. It's complaining that: No instance for (RegexContext Regex Data.ByteString.ByteString (AllTextSubmatches [] a0)) which is too cryptic for me. Is it not able to form a regular expression with a ByteString argument? From the documentation of Text.Regex.Posix it seems that it should be. Maybe it's because I am trying to "read (r!!1) :: Double" which I am having issues with also. Is (r!!1) a ByteString? And if so, how would I convert that to a Double? Thanks, nick

...

2) Appending list wrong operation to do in performance sensitive code. (++) traverses its first argument so it's O(n) in its length.

What exactly are you tryeing to do? Create a histogram?

The Haskell script was compiled with "ghc --make printMatrixDecay.hs".

...
If you want performance you absolutely should use -O2.

[1] http://hackage.haskell.org/**package/text http://hackage.haskell.org/package/text [2] http://hackage.haskell.org/**package/bytestring http://hackage.haskell.org/package/bytestring

______________________________**_________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/**mailman/listinfo/haskell-cafe http://www.haskell.org/mailman/listinfo/haskell-cafe

Aleksey Khudyakov

13 Feb 13 Feb

9:20 p.m.

On 10.02.2013 02:30, Nicolas Bock wrote:

...

Hi Aleksey,

could you show me how I would use ByteString? I can't get the script to compile. It's complaining that:

No instance for (RegexContext Regex Data.ByteString.ByteString (AllTextSubmatches [] a0))

which is too cryptic for me. Is it not able to form a regular expression with a ByteString argument? From the documentation of Text.Regex.Posix it seems that it should be. Maybe it's because I am trying to "read (r!!1) :: Double" which I am having issues with also. Is (r!!1) a ByteString? And if so, how would I convert that to a Double?

It's error message from regex library you use. I can't say what exactly it means, I never used it. But most likely it cannot work with bytestrings. Most other languages rely on regexp as go to tool for parsing. In haskell main parsing tools are parser combinators such as parsec[1] or attoparsec[2]. Parsec is more generic and attoparsec is much faster. In attachment program which uses attoparsec for parsing it's about 2times slower than C++ example posted in the thread. [1] http://hackage.haskell.org/package/parsec [2] http://hackage.haskell.org/package/attoparsec

Nicolas Bock

9 Feb 9 Feb

10:35 p.m.

On Fri, Feb 8, 2013 at 1:23 PM, Aleksey Khudyakov

...

wrote:

...

On 08.02.2013 23:26, Nicolas Bock wrote:

...
Hi list,

I wrote a script that reads matrix elements from standard input, parses the input using a regular expression, and then bins the matrix elements by magnitude. I wrote the same script in python (just to be sure :) ) and find that the python version vastly outperforms the Haskell script.

General performance hints

1) Strings are slow. Fast alternatives are text[1] for textual data and bytestrings[2] for binary data. I can't say anything about performance of Text.Regex.Posix.

2) Appending list wrong operation to do in performance sensitive code. (++) traverses its first argument so it's O(n) in its length.

What exactly are you tryeing to do? Create a histogram?

The Haskell script was compiled with "ghc --make printMatrixDecay.hs".

...
If you want performance you absolutely should use -O2.

Another question: When I compile the code with --make and -O2, and then run it on a larger matrix, I get this error message:

$ ./createMatrixDump.py -N 512 | ./printMatrixDecay Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize -RTS' to increase it. When I use "runghc" instead, I don't get an error. What does this error mean, and how do I fix it? Thanks, nick

...

[1] http://hackage.haskell.org/**package/text http://hackage.haskell.org/package/text [2] http://hackage.haskell.org/**package/bytestring http://hackage.haskell.org/package/bytestring

______________________________**_________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/**mailman/listinfo/haskell-cafe http://www.haskell.org/mailman/listinfo/haskell-cafe

Bob Ippolito

11:12 p.m.

I've been playing with your example to optimize it a bit, I have to run but here's what I have so far. It's about as fast as the Python code, I'll make it faster when I have more time over the next few days. See https://gist.github.com/etrepum/4747507 and https://gist.github.com/etrepum/4747507/revisions On Sat, Feb 9, 2013 at 2:35 PM, Nicolas Bock wrote:

...

On Fri, Feb 8, 2013 at 1:23 PM, Aleksey Khudyakov < alexey.skladnoy@gmail.com> wrote:

...
On 08.02.2013 23:26, Nicolas Bock wrote:

...
Hi list,

I wrote a script that reads matrix elements from standard input, parses the input using a regular expression, and then bins the matrix elements by magnitude. I wrote the same script in python (just to be sure :) ) and find that the python version vastly outperforms the Haskell script.

General performance hints

1) Strings are slow. Fast alternatives are text[1] for textual data and bytestrings[2] for binary data. I can't say anything about performance of Text.Regex.Posix.

2) Appending list wrong operation to do in performance sensitive code. (++) traverses its first argument so it's O(n) in its length.

What exactly are you tryeing to do? Create a histogram?

The Haskell script was compiled with "ghc --make printMatrixDecay.hs".

...
If you want performance you absolutely should use -O2.

Another question: When I compile the code with --make and -O2, and then run it on a larger matrix, I get this error message:

$ ./createMatrixDump.py -N 512 | ./printMatrixDecay Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize -RTS' to increase it.

When I use "runghc" instead, I don't get an error. What does this error mean, and how do I fix it?

Thanks,

nick

...
[1] http://hackage.haskell.org/**package/text http://hackage.haskell.org/package/text [2] http://hackage.haskell.org/**package/bytestring http://hackage.haskell.org/package/bytestring

______________________________**_________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/**mailman/listinfo/haskell-cafe http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Bob Ippolito

8 Feb 8 Feb

8:45 p.m.

Do you mind posting createMatrixDump.py and printMatrixDecay.py? That would certainly make it easier to help you. On Fri, Feb 8, 2013 at 11:26 AM, Nicolas Bock wrote:

...

Hi list,

I wrote a script that reads matrix elements from standard input, parses the input using a regular expression, and then bins the matrix elements by magnitude. I wrote the same script in python (just to be sure :) ) and find that the python version vastly outperforms the Haskell script.

To be concrete:

$ time ./createMatrixDump.py -N 128 | ./printMatrixDecay real 0m2.655s user 0m2.677s sys 0m0.095s

$ time ./createMatrixDump.py -N 128 | ./printMatrixDecay.py - real 0m0.445s user 0m0.615s sys 0m0.032s

The Haskell script was compiled with "ghc --make printMatrixDecay.hs".

Could you have a look at the script and give me some pointers as to where I could improve it, both in terms of performance and also generally, as I am very new to Haskell.

Thanks already,

nick

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Nicolas Bock

8:56 p.m.

Sorry, should have done this right away. Here are the other two scripts. On Fri, Feb 8, 2013 at 1:45 PM, Bob Ippolito wrote:

...

Do you mind posting createMatrixDump.py and printMatrixDecay.py? That would certainly make it easier to help you.

On Fri, Feb 8, 2013 at 11:26 AM, Nicolas Bock wrote:

...
Hi list,

I wrote a script that reads matrix elements from standard input, parses the input using a regular expression, and then bins the matrix elements by magnitude. I wrote the same script in python (just to be sure :) ) and find that the python version vastly outperforms the Haskell script.

To be concrete:

$ time ./createMatrixDump.py -N 128 | ./printMatrixDecay real 0m2.655s user 0m2.677s sys 0m0.095s

$ time ./createMatrixDump.py -N 128 | ./printMatrixDecay.py - real 0m0.445s user 0m0.615s sys 0m0.032s

The Haskell script was compiled with "ghc --make printMatrixDecay.hs".

Could you have a look at the script and give me some pointers as to where I could improve it, both in terms of performance and also generally, as I am very new to Haskell.

Thanks already,

nick

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Branimir Maksimovic

11:57 p.m.

Heh, I have wrote c++ version and that is much faster than python ;) bmaxa@maxa:~/haskell$ time ./createMatrixDump.py -N 128 > output.txt real 0m0.041suser 0m0.040ssys 0m0.000sbmaxa@maxa:~/haskell$ time ./printMatrixDecay.py - < output.txt(-) read 16384 matrix elements (128x128 = 16384)[0.00e+00, 1.00e-08) = 0 (0.00%) 0[1.00e-08, 1.00e-07) = 0 (0.00%) 0[1.00e-07, 1.00e-06) = 0 (0.00%) 0[1.00e-06, 1.00e-05) = 0 (0.00%) 0[1.00e-05, 1.00e-04) = 1 (0.00%) 1[1.00e-04, 1.00e-03) = 15 (0.00%) 16[1.00e-03, 1.00e-02) = 149 (0.00%) 165[1.00e-02, 1.00e-01) = 1425 (0.00%) 1590[1.00e-01, 1.00e+00) = 14794 (0.00%) 16384[1.00e+00, 2.00e+00) = 0 (0.00%) 16384 real 0m0.081suser 0m0.072ssys 0m0.008sbmaxa@maxa:~/haskell$ time ./printMatrixDecay < output.txtread 16384 matrix elements (128x128 = 16384)[0.00e+00, 1.00e-08) = 0 (0.00%) 0[1.00e-08, 1.00e-07) = 0 (0.00%) 0[1.00e-07, 1.00e-06) = 0 (0.00%) 0[1.00e-06, 1.00e-05) = 0 (0.00%) 0[1.00e-05, 1.00e-04) = 1 (0.01%) 1[1.00e-04, 1.00e-03) = 15 (0.09%) 16[1.00e-03, 1.00e-02) = 149 (0.91%) 165[1.00e-02, 1.00e-01) = 1425 (8.70%) 1590[1.00e-01, 1.00e+00) = 14794 (90.30%) 16384[1.00e+00, 2.00e+00) = 0 (0.00%) 16384 real 0m0.018suser 0m0.012ssys 0m0.004s unfortunately g++ does not have regex implemented yet so I used libpcre ... #include #include <sstream>#include <cstdio>#include <cmath>#include <iostream>#include <stdexcept>#include <vector> template <class F>void regex(const std::string& in, const std::string& pattern,int n,F f){ int ovec[3*n],position; const char* error; int errorpos; pcre* pe = pcre_compile(pattern.c_str(),0,&error,&errorpos,0); if(!pe)throw std::runtime_error(error); pcre_extra* extra=pcre_study(pe,0,&error); for(position = 0; pcre_exec(pe,extra,in.c_str(),in.size(),position,0,ovec,3*n)>=0; position = ovec[1])f(position,ovec); f(position,ovec); pcre_free(extra); pcre_free(pe); } int main(){ std::ios::sync_with_stdio(false); std::ostringstream oss; oss << std::cin.rdbuf(); const std::string& in = oss.str(); std::vector<double> strataBounds = { 0.0, 1.0e-8, 1.0e-7, 1.0e-6, 1.0e-5, 1.0e-4, 1.0e-3, 1.0e-2, 1.0e-1, 1.0, 2.0 }; std::vector<int> strataCounts(strataBounds.size()); unsigned N = 0; auto f = [&](int position,int* ovec) { if(int(position) > ovec[0])return; ++N; double aij = 0.0; std::istringstream iss(in.substr(ovec[2],ovec[3]-ovec[2])); iss >> aij; aij=fabs(aij); for(unsigned i = 0; i < strataBounds.size() - 1; ++i) { if(aij >= strataBounds[i] && aij < strataBounds[i+1]) { ++strataCounts[i]; break; } } }; regex(in,"matrix.*= ([0-9.eE+-]+)\n",2,f); printf("read %d matrix elements (%dx%d = %d)\n",N,int(sqrt(N)),int(sqrt(N)),N); int total = 0; for(unsigned i = 0; i< strataBounds.size()-1;++i) { total += strataCounts[i]; printf("[%1.2e, %1.2e) = %d (%1.2f%%) %d\n", strataBounds[i], strataBounds[i+1], strataCounts[i], 100*(double(strataCounts[i])/N), total); }} From: nicolasbock@gmail.com Date: Fri, 8 Feb 2013 12:26:09 -0700 To: haskell-cafe@haskell.org Subject: [Haskell-cafe] performance question Hi list, I wrote a script that reads matrix elements from standard input, parses the input using a regular expression, and then bins the matrix elements by magnitude. I wrote the same script in python (just to be sure :) ) and find that the python version vastly outperforms the Haskell script. To be concrete: $ time ./createMatrixDump.py -N 128 | ./printMatrixDecayreal 0m2.655s user 0m2.677ssys 0m0.095s $ time ./createMatrixDump.py -N 128 | ./printMatrixDecay.py - real 0m0.445suser 0m0.615ssys 0m0.032s The Haskell script was compiled with "ghc --make printMatrixDecay.hs". Could you have a look at the script and give me some pointers as to where I could improve it, both in terms of performance and also generally, as I am very new to Haskell. Thanks already, nick _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Branimir Maksimovic

10 Feb 10 Feb

6:54 a.m.

Here is haskell version that is faster than python, almost as fast as c++.You need to install bytestring-lexing package for readDouble. bmaxa@maxa:~/haskell$ time ./printMatrixDecay - < output.txtread 16384 matrix elements (128x128 = 16384)[0.00e0, 1.00e-8) = 0 (0.00%) 0[1.00e-8, 1.00e-7) = 0 (0.00%) 0[1.00e-7, 1.00e-6) = 0 (0.00%) 0[1.00e-6, 1.00e-5) = 0 (0.00%) 0[1.00e-5, 1.00e-4) = 1 (0.01%) 1[1.00e-4, 1.00e-3) = 17 (0.10%) 18[1.00e-3, 1.00e-2) = 155 (0.95%) 173[1.00e-2, 1.00e-1) = 1434 (8.75%) 1607[1.00e-1, 1.00e0) = 14777 (90.19%) 16384[1.00e0, 2.00e0) = 0 (0.00%) 16384 real 0m0.031suser 0m0.028ssys 0m0.000sbmaxa@maxa:~/haskell$ time ./printMatrixDecay.py - < output.txt(-) read 16384 matrix elements (128x128 = 16384)[0.00e+00, 1.00e-08) = 0 (0.00%) 0[1.00e-08, 1.00e-07) = 0 (0.00%) 0[1.00e-07, 1.00e-06) = 0 (0.00%) 0[1.00e-06, 1.00e-05) = 0 (0.00%) 0[1.00e-05, 1.00e-04) = 1 (0.00%) 1[1.00e-04, 1.00e-03) = 17 (0.00%) 18[1.00e-03, 1.00e-02) = 155 (0.00%) 173[1.00e-02, 1.00e-01) = 1434 (0.00%) 1607[1.00e-01, 1.00e+00) = 14777 (0.00%) 16384[1.00e+00, 2.00e+00) = 0 (0.00%) 16384 real 0m0.081suser 0m0.080ssys 0m0.000s Program follows... import System.Environmentimport Text.Printfimport Text.Regex.PCREimport Data.Maybeimport Data.Array.IOimport Data.Array.Unboxedimport qualified Data.ByteString.Char8 as Bimport Data.ByteString.Lex.Double (readDouble) strataBounds :: UArray Int DoublestrataBounds = listArray (0,10) [ 0.0, 1.0e-8, 1.0e-7, 1.0e-6, 1.0e-5, 1.0e-4, 1.0e-3, 1.0e-2, 1.0e-1, 1.0, 2.0 ] newStrataCounts :: IO(IOUArray Int Int)newStrataCounts = newArray (bounds strataBounds) 0 main = do l <- B.getContents let a = B.lines l strataCounts <- newStrataCounts n <- calculate strataCounts a 0 let printStrataCounts :: IO () printStrataCounts = do let s = round $ sqrt (fromIntegral n::Double) :: Int printf "read %d matrix elements (%dx%d = %d)\n" n s s n printStrataCounts' 0 0 printStrataCounts' :: Int -> Int -> IO () printStrataCounts' i total | i < (snd $ bounds strataBounds) = do count <- readArray strataCounts i let p :: Double p = (100.0*(fromIntegral count) :: Double)/(fromIntegral n :: Double) printf "[%1.2e, %1.2e) = %i (%1.2f%%) %i\n" (strataBounds ! i) (strataBounds ! (i+1)) count p (total + count) printStrataCounts' (i+1) (total+count) | otherwise = return () printStrataCounts calculate :: IOUArray Int Int -> [B.ByteString] -> Int -> IO Intcalculate _ [] n = return ncalculate counts (l:ls) n = do let a = case getAllTextSubmatches $ l =~ B.pack "matrix.*= ([0-9eE.+-]+)$" :: [B.ByteString] of [_,v] -> Just (readDouble v) :: Maybe (Maybe (Double,B.ByteString)) _ -> Nothing b = (fst.fromJust.fromJust) a loop :: Int -> IO() loop i | i < (snd $ bounds strataBounds) = if (b >= (strataBounds ! i)) && (b < (strataBounds ! (i+1))) then do c <- readArray counts i writeArray counts i (c+1) else loop (i+1) | otherwise = return () if isNothing a then calculate counts ls n else do loop 0 calculate counts ls (n+1) From: nicolasbock@gmail.com Date: Fri, 8 Feb 2013 12:26:09 -0700 To: haskell-cafe@haskell.org Subject: [Haskell-cafe] performance question Hi list, I wrote a script that reads matrix elements from standard input, parses the input using a regular expression, and then bins the matrix elements by magnitude. I wrote the same script in python (just to be sure :) ) and find that the python version vastly outperforms the Haskell script. To be concrete: $ time ./createMatrixDump.py -N 128 | ./printMatrixDecayreal 0m2.655s user 0m2.677ssys 0m0.095s $ time ./createMatrixDump.py -N 128 | ./printMatrixDecay.py - real 0m0.445suser 0m0.615ssys 0m0.032s The Haskell script was compiled with "ghc --make printMatrixDecay.hs". Could you have a look at the script and give me some pointers as to where I could improve it, both in terms of performance and also generally, as I am very new to Haskell. Thanks already, nick _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Nicolas Bock

12 Feb 12 Feb

10:57 p.m.

Thanks so much for your efforts, this really helped! Thanks again, nick On Sat, Feb 9, 2013 at 11:54 PM, Branimir Maksimovic wrote:

...

Here is haskell version that is faster than python, almost as fast as c++. You need to install bytestring-lexing package for readDouble.

bmaxa@maxa:~/haskell$ time ./printMatrixDecay - < output.txt read 16384 matrix elements (128x128 = 16384) [0.00e0, 1.00e-8) = 0 (0.00%) 0 [1.00e-8, 1.00e-7) = 0 (0.00%) 0 [1.00e-7, 1.00e-6) = 0 (0.00%) 0 [1.00e-6, 1.00e-5) = 0 (0.00%) 0 [1.00e-5, 1.00e-4) = 1 (0.01%) 1 [1.00e-4, 1.00e-3) = 17 (0.10%) 18 [1.00e-3, 1.00e-2) = 155 (0.95%) 173 [1.00e-2, 1.00e-1) = 1434 (8.75%) 1607 [1.00e-1, 1.00e0) = 14777 (90.19%) 16384 [1.00e0, 2.00e0) = 0 (0.00%) 16384

real 0m0.031s user 0m0.028s sys 0m0.000s bmaxa@maxa:~/haskell$ time ./printMatrixDecay.py - < output.txt (-) read 16384 matrix elements (128x128 = 16384) [0.00e+00, 1.00e-08) = 0 (0.00%) 0 [1.00e-08, 1.00e-07) = 0 (0.00%) 0 [1.00e-07, 1.00e-06) = 0 (0.00%) 0 [1.00e-06, 1.00e-05) = 0 (0.00%) 0 [1.00e-05, 1.00e-04) = 1 (0.00%) 1 [1.00e-04, 1.00e-03) = 17 (0.00%) 18 [1.00e-03, 1.00e-02) = 155 (0.00%) 173 [1.00e-02, 1.00e-01) = 1434 (0.00%) 1607 [1.00e-01, 1.00e+00) = 14777 (0.00%) 16384 [1.00e+00, 2.00e+00) = 0 (0.00%) 16384

real 0m0.081s user 0m0.080s sys 0m0.000s

Program follows...

import System.Environment import Text.Printf import Text.Regex.PCRE import Data.Maybe import Data.Array.IO import Data.Array.Unboxed import qualified Data.ByteString.Char8 as B import Data.ByteString.Lex.Double (readDouble)

strataBounds :: UArray Int Double strataBounds = listArray (0,10) [ 0.0, 1.0e-8, 1.0e-7, 1.0e-6, 1.0e-5, 1.0e-4, 1.0e-3, 1.0e-2, 1.0e-1, 1.0, 2.0 ]

newStrataCounts :: IO(IOUArray Int Int) newStrataCounts = newArray (bounds strataBounds) 0

main = do l <- B.getContents let a = B.lines l strataCounts <- newStrataCounts n <- calculate strataCounts a 0 let printStrataCounts :: IO () printStrataCounts = do let s = round $ sqrt (fromIntegral n::Double) :: Int printf "read %d matrix elements (%dx%d = %d)\n" n s s n printStrataCounts' 0 0 printStrataCounts' :: Int -> Int -> IO () printStrataCounts' i total | i < (snd $ bounds strataBounds) = do count <- readArray strataCounts i let p :: Double p = (100.0*(fromIntegral count) :: Double)/(fromIntegral n :: Double) printf "[%1.2e, %1.2e) = %i (%1.2f%%) %i\n" (strataBounds ! i) (strataBounds ! (i+1)) count p (total + count) printStrataCounts' (i+1) (total+count) | otherwise = return () printStrataCounts

calculate :: IOUArray Int Int -> [B.ByteString] -> Int -> IO Int calculate _ [] n = return n calculate counts (l:ls) n = do let a = case getAllTextSubmatches $ l =~ B.pack "matrix.*= ([0-9eE.+-]+)$" :: [B.ByteString] of [_,v] -> Just (readDouble v) :: Maybe (Maybe (Double,B.ByteString)) _ -> Nothing b = (fst.fromJust.fromJust) a loop :: Int -> IO() loop i | i < (snd $ bounds strataBounds) = if (b >= (strataBounds ! i)) && (b < (strataBounds ! (i+1))) then do c <- readArray counts i writeArray counts i (c+1) else loop (i+1) | otherwise = return () if isNothing a then calculate counts ls n else do loop 0 calculate counts ls (n+1)

------------------------------ From: nicolasbock@gmail.com Date: Fri, 8 Feb 2013 12:26:09 -0700 To: haskell-cafe@haskell.org Subject: [Haskell-cafe] performance question

Hi list,

I wrote a script that reads matrix elements from standard input, parses the input using a regular expression, and then bins the matrix elements by magnitude. I wrote the same script in python (just to be sure :) ) and find that the python version vastly outperforms the Haskell script.

To be concrete:

$ time ./createMatrixDump.py -N 128 | ./printMatrixDecay real 0m2.655s user 0m2.677s sys 0m0.095s

$ time ./createMatrixDump.py -N 128 | ./printMatrixDecay.py - real 0m0.445s user 0m0.615s sys 0m0.032s

The Haskell script was compiled with "ghc --make printMatrixDecay.hs".

Could you have a look at the script and give me some pointers as to where I could improve it, both in terms of performance and also generally, as I am very new to Haskell.

Thanks already,

nick

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

briand＠aracnet.com

13 Feb 13 Feb

4:32 a.m.

On Tue, 12 Feb 2013 15:57:37 -0700 Nicolas Bock wrote:

...

...
Here is haskell version that is faster than python, almost as fast as c++. You need to install bytestring-lexing package for readDouble.

I was hoping Branimir could comment on how the improvements were allocated. how much is due to text.regex.pcre (which looks to be a wrapper to libpcre) ? how much can be attributed to using data.bytestring ? you have to admit, it's amazing how well a byte-compiled, _dynamically typed_ interpreter can do against an actualy native code compiler. Can't regex be done effectively in haskell ? Is it something that can't be done, or is it just such minimal effort to link to pcre that it's not worth the trouble ? Brian

...

...
import Text.Regex.PCRE import Data.Maybe import Data.Array.IO import Data.Array.Unboxed import qualified Data.ByteString.Char8 as B import Data.ByteString.Lex.Double (readDouble)

strataBounds :: UArray Int Double strataBounds = listArray (0,10) [ 0.0, 1.0e-8, 1.0e-7, 1.0e-6, 1.0e-5, 1.0e-4, 1.0e-3, 1.0e-2, 1.0e-1, 1.0, 2.0 ]

newStrataCounts :: IO(IOUArray Int Int) newStrataCounts = newArray (bounds strataBounds) 0

main = do l <- B.getContents let a = B.lines l strataCounts <- newStrataCounts n <- calculate strataCounts a 0 let printStrataCounts :: IO () printStrataCounts = do let s = round $ sqrt (fromIntegral n::Double) :: Int printf "read %d matrix elements (%dx%d = %d)\n" n s s n printStrataCounts' 0 0 printStrataCounts' :: Int -> Int -> IO () printStrataCounts' i total | i < (snd $ bounds strataBounds) = do count <- readArray strataCounts i let p :: Double p = (100.0*(fromIntegral count) :: Double)/(fromIntegral n :: Double) printf "[%1.2e, %1.2e) = %i (%1.2f%%) %i\n" (strataBounds ! i) (strataBounds ! (i+1)) count p (total + count) printStrataCounts' (i+1) (total+count) | otherwise = return () printStrataCounts

calculate :: IOUArray Int Int -> [B.ByteString] -> Int -> IO Int calculate _ [] n = return n calculate counts (l:ls) n = do let a = case getAllTextSubmatches $ l =~ B.pack "matrix.*= ([0-9eE.+-]+)$" :: [B.ByteString] of [_,v] -> Just (readDouble v) :: Maybe (Maybe (Double,B.ByteString)) _ -> Nothing b = (fst.fromJust.fromJust) a loop :: Int -> IO() loop i | i < (snd $ bounds strataBounds) = if (b >= (strataBounds ! i)) && (b < (strataBounds ! (i+1))) then do c <- readArray counts i writeArray counts i (c+1) else loop (i+1) | otherwise = return () if isNothing a then calculate counts ls n else do loop 0 calculate counts ls (n+1)

------------------------------ From: nicolasbock@gmail.com Date: Fri, 8 Feb 2013 12:26:09 -0700 To: haskell-cafe@haskell.org Subject: [Haskell-cafe] performance question

Hi list,

I wrote a script that reads matrix elements from standard input, parses the input using a regular expression, and then bins the matrix elements by magnitude. I wrote the same script in python (just to be sure :) ) and find that the python version vastly outperforms the Haskell script.

To be concrete:

$ time ./createMatrixDump.py -N 128 | ./printMatrixDecay real 0m2.655s user 0m2.677s sys 0m0.095s

$ time ./createMatrixDump.py -N 128 | ./printMatrixDecay.py - real 0m0.445s user 0m0.615s sys 0m0.032s

The Haskell script was compiled with "ghc --make printMatrixDecay.hs".

Could you have a look at the script and give me some pointers as to where I could improve it, both in terms of performance and also generally, as I am very new to Haskell.

Thanks already,

nick

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Bob Ippolito

5:16 a.m.

On Tuesday, February 12, 2013, wrote:

...

On Tue, 12 Feb 2013 15:57:37 -0700 Nicolas Bock javascript:;> wrote:

...
...
Here is haskell version that is faster than python, almost as fast as c++. You need to install bytestring-lexing package for readDouble.

I was hoping Branimir could comment on how the improvements were allocated.

how much is due to text.regex.pcre (which looks to be a wrapper to libpcre) ?

how much can be attributed to using data.bytestring ?

you have to admit, it's amazing how well a byte-compiled, _dynamically typed_ interpreter can do against an actualy native code compiler. Can't regex be done effectively in haskell ? Is it something that can't be done, or is it just such minimal effort to link to pcre that it's not worth the trouble ?

I think that there are two bottlenecks: the regex engine, and converting a bytestring to a double. There doesn't appear to be a fast and accurate strtod implementation for Haskell, and the faster regex implementations that I could find appear to be unmaintained.

...

Brian

...
...
import Text.Regex.PCRE import Data.Maybe import Data.Array.IO import Data.Array.Unboxed import qualified Data.ByteString.Char8 as B import Data.ByteString.Lex.Double (readDouble)

strataBounds :: UArray Int Double strataBounds = listArray (0,10) [ 0.0, 1.0e-8, 1.0e-7, 1.0e-6, 1.0e-5, 1.0e-4, 1.0e-3, 1.0e-2, 1.0e-1, 1.0, 2.0 ]

newStrataCounts :: IO(IOUArray Int Int) newStrataCounts = newArray (bounds strataBounds) 0

main = do l <- B.getContents let a = B.lines l strataCounts <- newStrataCounts n <- calculate strataCounts a 0 let printStrataCounts :: IO () printStrataCounts = do let s = round $ sqrt (fromIntegral n::Double) :: Int printf "read %d matrix elements (%dx%d = %d)\n" n s s n printStrataCounts' 0 0 printStrataCounts' :: Int -> Int -> IO () printStrataCounts' i total | i < (snd $ bounds strataBounds) = do count <- readArray strataCounts i let p :: Double p = (100.0*(fromIntegral count) :: Double)/(fromIntegral n :: Double) printf "[%1.2e, %1.2e) = %i (%1.2f%%) %i\n" (strataBounds ! i) (strataBounds ! (i+1)) count p (total + count) printStrataCounts' (i+1) (total+count) | otherwise = return () printStrataCounts

calculate :: IOUArray Int Int -> [B.ByteString] -> Int -> IO Int calculate _ [] n = return n calculate counts (l:ls) n = do let a = case getAllTextSubmatches $ l =~ B.pack "matrix.*= ([0-9eE.+-]+)$" :: [B.ByteString] of [_,v] -> Just (readDouble v) :: Maybe (Maybe (Double,B.ByteString)) _ -> Nothing b = (fst.fromJust.fromJust) a loop :: Int -> IO() loop i | i < (snd $ bounds strataBounds) = if (b >= (strataBounds ! i)) && (b < (strataBounds ! (i+1))) then do c <- readArray counts i writeArray counts i (c+1) else loop (i+1) | otherwise = return () if isNothing a then calculate counts ls n else do loop 0 calculate counts ls (n+1)

------------------------------ From: nicolasbock@gmail.com javascript:; Date: Fri, 8 Feb 2013 12:26:09 -0700 To: haskell-cafe@haskell.org javascript:; Subject: [Haskell-cafe] performance question

Hi list,

I wrote a script that reads matrix elements from standard input, parses the input using a regular expression, and then bins the matrix elements by magnitude. I wrote the same script in python (just to be sure :) ) and find that the python version vastly outperforms the Haskell script.

To be concrete:

$ time ./createMatrixDump.py -N 128 | ./printMatrixDecay real 0m2.655s user 0m2.677s sys 0m0.095s

$ time ./createMatrixDump.py -N 128 | ./printMatrixDecay.py - real 0m0.445s user 0m0.615s sys 0m0.032s

The Haskell script was compiled with "ghc --make printMatrixDecay.hs".

Could you have a look at the script and give me some pointers as to where I could improve it, both in terms of performance and also generally, as I am very new to Haskell.

Thanks already,

nick

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org javascript:; http://www.haskell.org/mailman/listinfo/haskell-cafe

Branimir Maksimovic

10:39 a.m.

ByteString gains most improvements as String must be converted o CStringfirst, internaly, in regex (this is warpper for libpcre), while ByteString not.libpcre is much faster than posix (I guess posix is also wrapper).Interface for libpcre is same as for Posix, there is no real effortin replacing it.

...

Date: Tue, 12 Feb 2013 20:32:01 -0800 From: briand@aracnet.com To: nicolasbock@gmail.com CC: bmaxa@hotmail.com; bob@redivi.com; haskell-cafe@haskell.org Subject: Re: [Haskell-cafe] performance question

On Tue, 12 Feb 2013 15:57:37 -0700 Nicolas Bock wrote:

...
...
Here is haskell version that is faster than python, almost as fast as c++. You need to install bytestring-lexing package for readDouble.

I was hoping Branimir could comment on how the improvements were allocated.

how much is due to text.regex.pcre (which looks to be a wrapper to libpcre) ?

how much can be attributed to using data.bytestring ?

you have to admit, it's amazing how well a byte-compiled, _dynamically typed_ interpreter can do against an actualy native code compiler. Can't regex be done effectively in haskell ? Is it something that can't be done, or is it just such minimal effort to link to pcre that it's not worth the trouble ?

Brian

Brandon Allbery

3:12 p.m.

On Tue, Feb 12, 2013 at 11:32 PM, wrote:

...

actualy native code compiler. Can't regex be done effectively in haskell ? Is it something that can't be done, or is it just such minimal effort to link to pcre that it's not worth the trouble ?

PCRE is pretty heavily optimized. POSIX regex engines generally rely on vendor regex libraries which my not be well optimized; there is a native Haskell implementation as well, but that one runs into a different issue, namely a lack of interest (regexes are often seen as "foreign" to Haskell-think, so there's little interest in making them work well; people who *do* need them for some reason usually punt to pcre). -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

Nicolas Bock

4:32 p.m.

Since I have very little experience with Haskell and am not used to Haskell-think yet, I don't quite understand your statement that regexes are seen as foreign to Haskell-think. Could you elaborate? What would a more "native" solution look like? From what I have learned so far, it seems to me that Haskell is a lot about clear, concise, and well structured code. I find regexes extremely compact and powerful, allowing for very concise code, which should fit the bill perfectly, or shouldn't it? Thanks, nick On Wed, Feb 13, 2013 at 8:12 AM, Brandon Allbery wrote:

...

On Tue, Feb 12, 2013 at 11:32 PM, wrote:

...
actualy native code compiler. Can't regex be done effectively in haskell ? Is it something that can't be done, or is it just such minimal effort to link to pcre that it's not worth the trouble ?

PCRE is pretty heavily optimized. POSIX regex engines generally rely on vendor regex libraries which my not be well optimized; there is a native Haskell implementation as well, but that one runs into a different issue, namely a lack of interest (regexes are often seen as "foreign" to Haskell-think, so there's little interest in making them work well; people who *do* need them for some reason usually punt to pcre).

-- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

David Thomas

4:39 p.m.

One way in which regexps are "foreign to Haskell-think" is that, if they break, they generally break at run-time. This could be ameliorated with template haskell, but a substantial portion of Haskell coders find that a smell itself. On Wed, Feb 13, 2013 at 8:32 AM, Nicolas Bock wrote:

...

Since I have very little experience with Haskell and am not used to Haskell-think yet, I don't quite understand your statement that regexes are seen as foreign to Haskell-think. Could you elaborate? What would a more "native" solution look like? From what I have learned so far, it seems to me that Haskell is a lot about clear, concise, and well structured code. I find regexes extremely compact and powerful, allowing for very concise code, which should fit the bill perfectly, or shouldn't it?

Thanks,

nick

On Wed, Feb 13, 2013 at 8:12 AM, Brandon Allbery wrote:

...
On Tue, Feb 12, 2013 at 11:32 PM, wrote:

...
actualy native code compiler. Can't regex be done effectively in haskell ? Is it something that can't be done, or is it just such minimal effort to link to pcre that it's not worth the trouble ?

PCRE is pretty heavily optimized. POSIX regex engines generally rely on vendor regex libraries which my not be well optimized; there is a native Haskell implementation as well, but that one runs into a different issue, namely a lack of interest (regexes are often seen as "foreign" to Haskell-think, so there's little interest in making them work well; people who *do* need them for some reason usually punt to pcre).

-- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Nicolas Trangez

4:43 p.m.

On Wed, 2013-02-13 at 08:39 -0800, David Thomas wrote:

...

One way in which regexps are "foreign to Haskell-think" is that, if they break, they generally break at run-time. This could be ameliorated with template haskell

Care to elaborate on the "ameliorate using TH" part? I figure regexes would be mostly used to parse some runtime-provided string, so how could compile-TH provide any help? Nicolas

MigMit

5:34 p.m.

Well, this runtime errors are actually type errors. Regexps are actually a DSL, which is not embedded in Haskell. But it could be. Strings won't work for that, but something like that would: filter (match $ "a" <> many anyChar <> ".txt") filenames and this certainly can be produced by TH like that: filter (match $(regexp "a.*\\.txt")) filenames On Feb 13, 2013, at 8:43 PM, Nicolas Trangez wrote:

...

On Wed, 2013-02-13 at 08:39 -0800, David Thomas wrote:

...
One way in which regexps are "foreign to Haskell-think" is that, if they break, they generally break at run-time. This could be ameliorated with template haskell

Care to elaborate on the "ameliorate using TH" part? I figure regexes would be mostly used to parse some runtime-provided string, so how could compile-TH provide any help?

Nicolas

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

David Thomas

5:43 p.m.

I don't think you can do much about "fails to match the input string" - indeed, that's often desired behavior... and "matches the wrong thing" you can only catch with testing. The simplest place template haskell could help with is when the expression isn't a valid expression in the first place, and will fail to compile. If you're just validating, I don't think you can do better; in order to improve your confidence of correctness, your only option is testing against a set of positives and negatives. If you're capturing, you might be able to do a little better, if you are able to get some of that info into the types (number of capture groups expected, for instance) - then, if your code expects to deal with a different number of captured pieces than your pattern represents, it can be caught at compile time. If you're capturing strings that you intend to convert to other types, and can decorate regexp components with the type they're going to capture (which can then be quickchecked - certainly a pattern should never match and then fail to read, &c), and if you are able to propagate this info during composition, you might actually be able to catch a good chunk of errors. Note that much of this works quite a bit different than most existing regexp library APIs, where you pass a bare string and captures wind up in some kind of list, which I expect is much of the reason no one's done it yet (so far as I'm aware). On Wed, Feb 13, 2013 at 8:43 AM, Nicolas Trangez wrote:

...

On Wed, 2013-02-13 at 08:39 -0800, David Thomas wrote:

...
One way in which regexps are "foreign to Haskell-think" is that, if they break, they generally break at run-time. This could be ameliorated with template haskell

Care to elaborate on the "ameliorate using TH" part? I figure regexes would be mostly used to parse some runtime-provided string, so how could compile-TH provide any help?

Nicolas

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

David Thomas

5:46 p.m.

The fact that parsec and attoparsec exist and can be pressed into service with reasonable performance (I think?) on tasks for which regexps are suitable is probably another big part of the reason no one's done it yet. I expect much of the plumbing would wind up looking a lot like those, actually. On Wed, Feb 13, 2013 at 9:43 AM, David Thomas wrote:

...

I don't think you can do much about "fails to match the input string" - indeed, that's often desired behavior... and "matches the wrong thing" you can only catch with testing.

The simplest place template haskell could help with is when the expression isn't a valid expression in the first place, and will fail to compile. If you're just validating, I don't think you can do better; in order to improve your confidence of correctness, your only option is testing against a set of positives and negatives.

If you're capturing, you might be able to do a little better, if you are able to get some of that info into the types (number of capture groups expected, for instance) - then, if your code expects to deal with a different number of captured pieces than your pattern represents, it can be caught at compile time.

If you're capturing strings that you intend to convert to other types, and can decorate regexp components with the type they're going to capture (which can then be quickchecked - certainly a pattern should never match and then fail to read, &c), and if you are able to propagate this info during composition, you might actually be able to catch a good chunk of errors.

Note that much of this works quite a bit different than most existing regexp library APIs, where you pass a bare string and captures wind up in some kind of list, which I expect is much of the reason no one's done it yet (so far as I'm aware).

On Wed, Feb 13, 2013 at 8:43 AM, Nicolas Trangez wrote:

...
On Wed, 2013-02-13 at 08:39 -0800, David Thomas wrote:

...
One way in which regexps are "foreign to Haskell-think" is that, if they break, they generally break at run-time. This could be ameliorated with template haskell

Care to elaborate on the "ameliorate using TH" part? I figure regexes would be mostly used to parse some runtime-provided string, so how could compile-TH provide any help?

Nicolas

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Brandon Allbery

6:17 p.m.

On Wed, Feb 13, 2013 at 12:46 PM, David Thomas wrote:

...

The fact that parsec and attoparsec exist and can be pressed into service with reasonable performance (I think?) on tasks for which regexps are suitable is probably another big part of the reason no one's done it yet. I expect much of the plumbing would wind up looking a lot like those, actually.

When I started out with Haskell, one of my early thoughts was about designing a DSL for Icon-style pattern matching; I dropped it when I realized I was reinventing (almost identically, at least for its lower level combinators) Parsec. Nothing really to be gained except from a tutelary standpoint. And the mapping from Icon patterns to regex patterns is pretty much mechanical if you phrase it so you aren't executing code in the middle. -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

Brandon Allbery

5:41 p.m.

On Wed, Feb 13, 2013 at 11:32 AM, Nicolas Bock wrote:

...

Since I have very little experience with Haskell and am not used to Haskell-think yet, I don't quite understand your statement that regexes are seen as foreign to Haskell-think. Could you elaborate? What would a more "native" solution look like? From what I have learned so far, it seems to me that Haskell is a lot about clear,

The native solution is a parser like parsec/attoparsec. The problem with regexes is that you can't at compile time verify that, for example, you have as many matching groups in the regex as the code using it expects, nor does an optional matching group behave as a Maybe like it should; nor are there nice ways to recover. A parser gives you full control and better compile time checking, and is generally recommended. -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

Aleksey Khudyakov

9:24 p.m.

On 13.02.2013 21:41, Brandon Allbery wrote:

...

On Wed, Feb 13, 2013 at 11:32 AM, Nicolas Bock mailto:nicolasbock@gmail.com> wrote:

Since I have very little experience with Haskell and am not used to Haskell-think yet, I don't quite understand your statement that regexes are seen as foreign to Haskell-think. Could you elaborate? What would a more "native" solution look like? From what I have learned so far, it seems to me that Haskell is a lot about clear,

The native solution is a parser like parsec/attoparsec. The problem with regexes is that you can't at compile time verify that, for example, you have as many matching groups in the regex as the code using it expects, nor does an optional matching group behave as a Maybe like it should; nor are there nice ways to recover. A parser gives you full control and better compile time checking, and is generally recommended.

Regexps only have this problem if they are compiled from string. Nothing prevents from building them using combinators. regex-applicaitve[1] uses this approach and quite nice to use. [1] http://hackage.haskell.org/package/regex-applicative

ok＠cs.otago.ac.nz

10:45 p.m.

...

On 13.02.2013 21:41, Brandon Allbery wrote:

...
The native solution is a parser like parsec/attoparsec.

"Aleksey Khudyakov" replied

...

Regexps only have this problem if they are compiled from string. Nothing prevents from building them using combinators. regex-applicative[1] uses this approach and quite nice to use.

[1] http://hackage.haskell.org/package/regex-applicative

That _is_ a nice package, but it _is_ 'a parser like parsec/attoparsec'.

Brandon Allbery

11:09 p.m.

On Wed, Feb 13, 2013 at 5:45 PM, wrote:

...

...
On 13.02.2013 21:41, Brandon Allbery wrote:

...
The native solution is a parser like parsec/attoparsec.

"Aleksey Khudyakov" replied

...
Regexps only have this problem if they are compiled from string. Nothing prevents from building them using combinators. regex-applicative[1] uses this approach and quite nice to use.

[1] http://hackage.haskell.org/package/regex-applicative

That _is_ a nice package, but it _is_ 'a parser like parsec/attoparsec'.

Well, yes; it's a case in point. -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

wren ng thornton

14 Feb 14 Feb

4:18 a.m.

On 2/13/13 11:32 AM, Nicolas Bock wrote:

...

Since I have very little experience with Haskell and am not used to Haskell-think yet, I don't quite understand your statement that regexes are seen as foreign to Haskell-think. Could you elaborate? What would a more "native" solution look like? From what I have learned so far, it seems to me that Haskell is a lot about clear, concise, and well structured code. I find regexes extremely compact and powerful, allowing for very concise code, which should fit the bill perfectly, or shouldn't it?

Regexes are powerful and concise for recognizing regular languages. They are not, however, very good for *parsing* regular languages; nor can they handle non-regular languages (unless you're relying on the badness of pcre). In other languages people press regexes into service for parsing because the alternative is using an external DSL like lex/yacc, javaCC, etc. Whereas, in Haskell, we have powerful and concise tools for parsing context-free languages and beyond (e.g., parsec, attoparsec). -- Live well, ~wren

Erik de Castro Lopo

4:50 a.m.

wren ng thornton wrote:

...

Regexes are powerful and concise for recognizing regular languages. They are not, however, very good for *parsing* regular languages; nor can they handle non-regular languages (unless you're relying on the badness of pcre). In other languages people press regexes into service for parsing because the alternative is using an external DSL like lex/yacc, javaCC, etc. Whereas, in Haskell, we have powerful and concise tools for parsing context-free languages and beyond (e.g., parsec, attoparsec).

This cannot be emphasized heavily enough. Once you have learnt how to use one or more of these parsec libraries they will become your main tool for parsing everything from complex input languages like haskell itself, all the way down to relatively simple config files. Parsec style parsers are built up out of small composable (and more importantly reusable) combinators, that are easier to read and easier to maintain than anything other than the most trivial regex. Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

Nicolas Bock

6:59 p.m.

I have to agree that reading and maintaining regular expressions can be challenging :) On Wed, Feb 13, 2013 at 9:50 PM, Erik de Castro Lopo wrote:

...

wren ng thornton wrote:

...
Regexes are powerful and concise for recognizing regular languages. They are not, however, very good for *parsing* regular languages; nor can they handle non-regular languages (unless you're relying on the badness of pcre). In other languages people press regexes into service for parsing because the alternative is using an external DSL like lex/yacc, javaCC, etc. Whereas, in Haskell, we have powerful and concise tools for parsing context-free languages and beyond (e.g., parsec, attoparsec).

This cannot be emphasized heavily enough.

Once you have learnt how to use one or more of these parsec libraries they will become your main tool for parsing everything from complex input languages like haskell itself, all the way down to relatively simple config files.

Parsec style parsers are built up out of small composable (and more importantly reusable) combinators, that are easier to read and easier to maintain than anything other than the most trivial regex.

Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Richard A. O'Keefe

10:43 p.m.

Just to play devil's advocate: 100% agreed that there are better things to do in Haskell _source code_ than regexps. The thing about regexps is that they can be accepted at run time as _data_. This means, for example, that they can be put in whatever you use for localisation. See for example YESEXPR/NOEXPR in

wren ng thornton

11:07 p.m.

On 2/13/13 11:18 PM, wren ng thornton wrote:

...

On 2/13/13 11:32 AM, Nicolas Bock wrote:

...
Since I have very little experience with Haskell and am not used to Haskell-think yet, I don't quite understand your statement that regexes are seen as foreign to Haskell-think. Could you elaborate? What would a more "native" solution look like? From what I have learned so far, it seems to me that Haskell is a lot about clear, concise, and well structured code. I find regexes extremely compact and powerful, allowing for very concise code, which should fit the bill perfectly, or shouldn't it?

Regexes are powerful and concise for recognizing regular languages. They are not, however, very good for *parsing* regular languages; nor can they handle non-regular languages (unless you're relying on the badness of pcre). In other languages people press regexes into service for parsing because the alternative is using an external DSL like lex/yacc, javaCC, etc. Whereas, in Haskell, we have powerful and concise tools for parsing context-free languages and beyond (e.g., parsec, attoparsec).

Just to be clear, the problem isn't that proper regexes are only good for regular languages (many files have regular syntax afterall). The problem is that regexes are only good for recognition. They're an excellent tool for deciding whether a given string is "good" or "bad"; but they're completely unsuitable for the task of parsing/interpreting a string into some structure or semantic response. If you've ever used tools like yacc or javaCC, one of the crucial things they offer is the ability to add these semantic responses. Parser combinator libraries in Haskell are similar, since the string processing is integrated into a programming language so we can say things like: myParser = do x <- blah guard (p x) y <- blargh return (f x y) where p and f can be an arbitrary Haskell functions. Perl extends on regular expressions to try and do things like this, but it's extremely baroque, hard to get right, and impossible to maintain. (N.B., I was raised on Perl and still love it.) And at some point we have to call into question the idea of regexes as an embedded DSL when we then turn around and try to have Perl be a DSL embedded into the regex language. One of the big things that makes regexes so nice is that they identify crucial combinators like choice and repetition. However, once those combinators have been identified, we can just offer them directly as functions in the host language. No need for a special DSL or special syntax. The big trick is doing this efficiently. Parser combinators were an academic curiosity for a long time until Parsec came around and made them efficient. And we've come a long way since then: with things like attoparsec, PEG parsing, and non-monadic applicative parsers (which can perform more optimizations because they can identify the structure of the grammar). The theory of regular expressions is indeed beautiful and elegant. However, it's a theory of recognition, not a theory of parsing; and that's a crucial distinction. Haskell is about clear, concise, and well-structured code; but to be clear, concise, and well-structured we have to choose the right tool for the job. -- Live well, ~wren

David Thomas

11:27 p.m.

(I'll be brief because my head is hurting, but please don't interpret that as an intent to offend) A few points: 1) Capture groups are all you need to do some meaningful interpretation of data; these were around long before perl. 2) Yacc is typically used in conjunction with lex, partly for (a) efficiency and partly for (b) ease of use (compared to writing out [a-z] as production rules). 3) I've actually used lex without yacc (well, flex without bison) when faced with dealing with a language that's regular (and easy enough to express that way - cf. an enormous finite subset of a context-free language). 2b is mostly irrelevant in Haskell, as Parsec already provides functions that can easily match the same things a regexp would. 2a, if it stands up to testing, is the best argument for ripping things apart in Haskell using a DFA. Parsec and cousins are efficient, but it's hard to beat a single table lookup per character. The questions are 1) is the difference enough to matter in many cases, and 2) is there a way to get this out of parsec without touching regexps? (It's not impossible that parsec already recognizes when a language is regular, although I'd be weakly surprised). On Thu, Feb 14, 2013 at 3:07 PM, wren ng thornton wrote:

...

On 2/13/13 11:18 PM, wren ng thornton wrote:

...
On 2/13/13 11:32 AM, Nicolas Bock wrote:

...
Since I have very little experience with Haskell and am not used to Haskell-think yet, I don't quite understand your statement that regexes are seen as foreign to Haskell-think. Could you elaborate? What would a more "native" solution look like? From what I have learned so far, it seems to me that Haskell is a lot about clear, concise, and well structured code. I find regexes extremely compact and powerful, allowing for very concise code, which should fit the bill perfectly, or shouldn't it?

Regexes are powerful and concise for recognizing regular languages. They are not, however, very good for *parsing* regular languages; nor can they handle non-regular languages (unless you're relying on the badness of pcre). In other languages people press regexes into service for parsing because the alternative is using an external DSL like lex/yacc, javaCC, etc. Whereas, in Haskell, we have powerful and concise tools for parsing context-free languages and beyond (e.g., parsec, attoparsec).

Just to be clear, the problem isn't that proper regexes are only good for regular languages (many files have regular syntax afterall). The problem is that regexes are only good for recognition. They're an excellent tool for deciding whether a given string is "good" or "bad"; but they're completely unsuitable for the task of parsing/interpreting a string into some structure or semantic response. If you've ever used tools like yacc or javaCC, one of the crucial things they offer is the ability to add these semantic responses. Parser combinator libraries in Haskell are similar, since the string processing is integrated into a programming language so we can say things like:

myParser = do x <- blah guard (p x) y <- blargh return (f x y)

where p and f can be an arbitrary Haskell functions. Perl extends on regular expressions to try and do things like this, but it's extremely baroque, hard to get right, and impossible to maintain. (N.B., I was raised on Perl and still love it.) And at some point we have to call into question the idea of regexes as an embedded DSL when we then turn around and try to have Perl be a DSL embedded into the regex language.

One of the big things that makes regexes so nice is that they identify crucial combinators like choice and repetition. However, once those combinators have been identified, we can just offer them directly as functions in the host language. No need for a special DSL or special syntax. The big trick is doing this efficiently. Parser combinators were an academic curiosity for a long time until Parsec came around and made them efficient. And we've come a long way since then: with things like attoparsec, PEG parsing, and non-monadic applicative parsers (which can perform more optimizations because they can identify the structure of the grammar).

The theory of regular expressions is indeed beautiful and elegant. However, it's a theory of recognition, not a theory of parsing; and that's a crucial distinction. Haskell is about clear, concise, and well-structured code; but to be clear, concise, and well-structured we have to choose the right tool for the job.

-- Live well, ~wren

______________________________**_________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/**mailman/listinfo/haskell-cafe http://www.haskell.org/mailman/listinfo/haskell-cafe

brandon s allbery kf8nh

11:33 p.m.

It's worth remembering that the main gain from lex/yacc had originally to do with making the generated programs fit into 64K address space on a PDP11 more than with any direct performance efficiency. -- brandon s allbery kf8nh Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Thursday, February 14, 2013 at 6:27 PM, David Thomas wrote:

...

(I'll be brief because my head is hurting, but please don't interpret that as an intent to offend)

A few points:

1) Capture groups are all you need to do some meaningful interpretation of data; these were around long before perl.

2) Yacc is typically used in conjunction with lex, partly for (a) efficiency and partly for (b) ease of use (compared to writing out [a-z] as production rules).

3) I've actually used lex without yacc (well, flex without bison) when faced with dealing with a language that's regular (and easy enough to express that way - cf. an enormous finite subset of a context-free language).

2b is mostly irrelevant in Haskell, as Parsec already provides functions that can easily match the same things a regexp would.

2a, if it stands up to testing, is the best argument for ripping things apart in Haskell using a DFA. Parsec and cousins are efficient, but it's hard to beat a single table lookup per character. The questions are 1) is the difference enough to matter in many cases, and 2) is there a way to get this out of parsec without touching regexps? (It's not impossible that parsec already recognizes when a language is regular, although I'd be weakly surprised).

On Thu, Feb 14, 2013 at 3:07 PM, wren ng thornton wrote:

...
On 2/13/13 11:18 PM, wren ng thornton wrote:

...
On 2/13/13 11:32 AM, Nicolas Bock wrote:

...
Since I have very little experience with Haskell and am not used to Haskell-think yet, I don't quite understand your statement that regexes are seen as foreign to Haskell-think. Could you elaborate? What would a more "native" solution look like? From what I have learned so far, it seems to me that Haskell is a lot about clear, concise, and well structured code. I find regexes extremely compact and powerful, allowing for very concise code, which should fit the bill perfectly, or shouldn't it?

Regexes are powerful and concise for recognizing regular languages. They are not, however, very good for *parsing* regular languages; nor can they handle non-regular languages (unless you're relying on the badness of pcre). In other languages people press regexes into service for parsing because the alternative is using an external DSL like lex/yacc, javaCC, etc. Whereas, in Haskell, we have powerful and concise tools for parsing context-free languages and beyond (e.g., parsec, attoparsec).

Just to be clear, the problem isn't that proper regexes are only good for regular languages (many files have regular syntax afterall). The problem is that regexes are only good for recognition. They're an excellent tool for deciding whether a given string is "good" or "bad"; but they're completely unsuitable for the task of parsing/interpreting a string into some structure or semantic response. If you've ever used tools like yacc or javaCC, one of the crucial things they offer is the ability to add these semantic responses. Parser combinator libraries in Haskell are similar, since the string processing is integrated into a programming language so we can say things like:

myParser = do x <- blah guard (p x) y <- blargh return (f x y)

where p and f can be an arbitrary Haskell functions. Perl extends on regular expressions to try and do things like this, but it's extremely baroque, hard to get right, and impossible to maintain. (N.B., I was raised on Perl and still love it.) And at some point we have to call into question the idea of regexes as an embedded DSL when we then turn around and try to have Perl be a DSL embedded into the regex language.

One of the big things that makes regexes so nice is that they identify crucial combinators like choice and repetition. However, once those combinators have been identified, we can just offer them directly as functions in the host language. No need for a special DSL or special syntax. The big trick is doing this efficiently. Parser combinators were an academic curiosity for a long time until Parsec came around and made them efficient. And we've come a long way since then: with things like attoparsec, PEG parsing, and non-monadic applicative parsers (which can perform more optimizations because they can identify the structure of the grammar).

The theory of regular expressions is indeed beautiful and elegant. However, it's a theory of recognition, not a theory of parsing; and that's a crucial distinction. Haskell is about clear, concise, and well-structured code; but to be clear, concise, and well-structured we have to choose the right tool for the job.

-- Live well, ~wren

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org (mailto:Haskell-Cafe@haskell.org) http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org (mailto:Haskell-Cafe@haskell.org) http://www.haskell.org/mailman/listinfo/haskell-cafe

4527

Age (days ago)

4533

Last active (days ago)

List overview

Download

34 comments

14 participants

participants (14)

Aleksey Khudyakov
Bob Ippolito
Brandon Allbery
brandon s allbery kf8nh
Branimir Maksimovic
briand＠aracnet.com
David Thomas
Erik de Castro Lopo
MigMit
Nicolas Bock
Nicolas Trangez
ok＠cs.otago.ac.nz
Richard A. O'Keefe
wren ng thornton