
tphyahoo wrote:
So the core question (speaking as a perler) is how do you write
my $s= 'abcdefg'; $s =~ s/a/z/g; $s =~ s/b/y/g; print "$s\n";
in haskell? There are various haskell regex libraries out there, including ones that advertise they are PCRE (Perl Compatible Reg Ex).
I updated the regex libraries for GHC 6.6. ( All the regex-* packages. ) The old API is still supported in Text.Regex. The old API has a replacement function, while the new API does not have one (yet). For simple regular expressions, where Posix and Perl agree, you can just use Text.Regex.subRegex which comes with GHC. In 6.6 this comes in the regex-compat package, and which calls the regex-posix backend via the interfaces defined in regex-base. All of these come with GHC, since GHC needs regex support to compile itself. So if you do not need more syntax than POSIX regex (with back references) then http://www.haskell.org/ghc/docs/latest/html/libraries/regex-compat/Text-Rege... works, but depends on the low performance posix-regex backend. This will run your example above, for instance. Better regex searching performance can be had by using the new interface via Text.Regex.Base with better backends and/or with Data.ByteString. In the future there will be Data.Sequence (of Char and perhaps Word8) support added to the backends. There is no updated API for performing replacements using a pluggable backend. The design space is too large with conflicting needs to be lazy or strict, time or space efficient, etc. The best thing is to write the replacement function that your application needs. You can use the new searching API (see micro-tutorial below) to write a replacement routine in less than a screen of code. For instance, the regex-compat version of Text.Regex.subRegex is
{- | Replaces every occurance of the given regexp with the replacement string.
In the replacement string, @\"\\1\"@ refers to the first substring; @\"\\2\"@ to the second, etc; and @\"\\0\"@ to the entire match. @\"\\\\\\\\\"@ will insert a literal backslash.
This is unsafe if the regex matches an empty string. -} subRegex :: Regex -- ^ Search pattern -> String -- ^ Input string -> String -- ^ Replacement text -> String -- ^ Output string subRegex _ "" _ = "" subRegex regexp inp repl = let bre = mkRegex "\\\\(\\\\|[0-9]+)" lookup _ [] _ = [] lookup [] _ _ = [] lookup match repl groups = case matchRegexAll bre repl of Nothing -> repl Just (lead, _, trail, bgroups) -> let newval = if (head bgroups) == "\\" then "\\" else let index = (read (head bgroups)) - 1 in if index == -1 then match else groups !! index in lead ++ newval ++ lookup match trail groups in case matchRegexAll regexp inp of Nothing -> inp Just (lead, match, trail, groups) -> lead ++ lookup match repl groups ++ (subRegex regexp trail repl)
You could just paste that code into a file that imports a different backend and it should work since it uses just the type class API. You might also improve on the above routine or specialize it. The above handle \0 \1 \2 substitutions (and \\ escaping) in the replacement string, including multi-digit references such as \15 for very large regular expressions. It operation only on [Char] and is somewhat lazy.
But which one to use? How hard to install? With the libs mentioned above, the "PCRE"-ness seems only to be for matching, not for substitutions.
I think if you paste the subRegex code above underneath an "import Text.Posix.PCRE" declaration then you get what you are looking for. To install: The regex-* package hosting is via darcs and has been copied/moved to http://darcs.haskell.org/packages/ (The stable regex-* versions) http://darcs.haskell.org/packages/regex-unstable/ (The unstable regex-* versions) so "darcs get --partial http://darcs.haskell.org/packages/regex-pcre" might be useful. They have (hopefully working) cabal files to make compiling and installing easy. Note that regex-pcre and regex-tre need libpcre and libtre to be installed separately. regex-posix needs a posix library, but GHC already provides this package with a working libary. These 3 come with GHC: regex-base defines the type classes and APIs and most RegexContext instances regex-compat imitates the old Text.Regex API using regex-posix regex-posix backend has awful performance. Not for heavy use. These 4 backends can be downloaded using darcs: regex-pcre uses libpcre and this imitates the PERL search syntaxs and semantics. regex-tre used libtre, a very fast posix-compatible (but LGPL) library. regex-parsec which is not very speedy, but is pure Haskell regex-dfa is pure haskell and fast, but cannot do subexpression capture yet, and has some problems (repeating an empty-matching pattern like "(a*)*" will create an infinite loop in compiling the regex). regex-devel is for hosting test suites and benchmarks. You won't need this.
http://www.cs.chalmers.se/~d00nibro/harp/ http://repetae.net/john/computer/haskell/JRegex/
JRegex is mostly superseded by the more flexible interface of regex-base. This flexibility allowed Data.ByteString to be used for efficiency as well as String. The new API for searching with regex works like this (a micro-tutorial): import Text.Regex.Base import Text.Regex.Posix -- or PCRE or Parsec or TRE or DFA The above should provide (=~) and (=~~) matching operators, which do ("text to be searched" =~ "regex to use") operations. The value returned is any of the instances of RegexContext (See documentation for Text.Regex.Base.Context). So in a numerical context (i.e. you demand an Int type) the search returns the number of matches, and in a more complicated context it returns all matches including all their captures (\0 \1 \2 \3 ...). =~ and =~~ differ in that the second operates in a Monadic context and this has access to a "fail" method when the search does not succeed. But =~ and =~~ are convenience methods for 1st : converting the regex string to a compiled form with 'makeRegex' 2nd : performing the search with 'match' (for =~) or 'matchM' (for =~~) The error handling (such as with a malformed regex) is primitive. The RegexMaker class defines 'makeRegex' as well as a more complicated 'makeRegexOpts' which lets you specify backend specific options (e.g. case-sensitivity). The error handling is primitive. The RegexContext class defines match and matchM which are still at a high level of the API and the return type is polymorphic (i.e. depends on the context of the call much like in PERL). The error handling is primitive. The RegexContext class does its magic by using the RegexLike class methods. The RegexLike methods are a medium level API and are implemented by instances in such modules as Text.Regex.Posix.String and Text.Regex.Posix.ByteString. The RegexMaker 'makeRegex' instances are also defined in these modules. The error handling is primitive. If you import a specific module such as Text.Regex.Posix.String then you can access the three lower-level API functions: 'compile' : used to build makeRegexOpts/makeRegex for RegexMaker 'execute' : used to perform a search and get back offsets and lengths of matching \0 \1 \2, etc, where \0 is the whole match. 'regexec' : used to perform a search and get back the actual substrings: (before match,the whole match,after the match,[list of captured sequences]) These three lower-level API functions are available in all the backends, and use "Either String a" to report errors in a sensible way. If for some very bizarre reason you don't like my API's then you can import whatever modules happen to have the raw API for that backend. For PCRE this is Text.Regex.PCRE.Wrap (which is Wrap.hsc file in the source, not Wrap.hs). This has the FFI interface to the libpcre library and a bunch of haskell enumerations and types defined, and exports functions with names starting with "wrap" that handle correctly calling the c library from Haskell. These are very specific to the PCRE backend, and the errors are always returned in a very detailed way via Either types.
So, I would like to know a good answer to this as well.
thomas.
Feel free to ask for more help and better answer.