
TextRegexLazy: The Text.Regex.Lazy replacement and enhancement for Text.Regex New Version: 0.44 Where: http://sourceforge.net/projects/lazy-regex Changes from 0.33 to 0.44 * Cabal * Compile with -Wall -Werror * Change DFAEngineFPS from Data.FastPackedString to Data.ByteString (still fairly untested) This was my first time packaging with cabal, and I am hoping it works for you. The tests have been cleaned up and now Cabal can run them. The change from FPS to ByteString was trivial. Question 1: What more would people want from a Regex engine for Data.ByteString? Question 2: Is there interest in getting this into an official release of the base libraries? The Compat module could at least replace or sit alongside the performance sink of the current Text.Regex code. -- Chris "I define UNIX as 30 definitions of regular expressions living under one roof." * Donald Knuth: Chapter 33 of the book Digital Typography, p. 649.

Hello Chris, Thursday, July 13, 2006, 1:03:19 AM, you wrote:
This was my first time packaging with cabal, and I am hoping it works for you.
are you included Makefile? this makes building & installation somewhat simpler for a user
Question 2: Is there interest in getting this into an official release of the base libraries? The Compat module could at least replace or sit alongside the performance sink of the current Text.Regex code.
i'm 120% want to see ByteString, regular expressions matching for String and ByteString, and JRegex (=~ operator implementation) to be included in GHC 6.6 -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Chris,
Thursday, July 13, 2006, 1:03:19 AM, you wrote:
This was my first time packaging with cabal, and I am hoping it works for you.
are you included Makefile? this makes building & installation somewhat simpler for a user
Yes, but the makefile is used just to compile Setup.hs to ./setup and tell the user to run that instead.
Question 2: Is there interest in getting this into an official release of the base libraries? The Compat module could at least replace or sit alongside the performance sink of the current Text.Regex code.
i'm 120% want to see ByteString, regular expressions matching for String and ByteString, and JRegex (=~ operator implementation) to be included in GHC 6.6
That typeclass interface is very handy, BUT it expects the thing being matched against is a list of something. This prevents making ByteString an instance of RegexLike. The answer will be to alter the type class to not make such an assumption. Luckily John Meacham put JRegex under the 3 clause BSD, so I will * Make a modified version of the type classes * Make Text.Regex.Lazy an instance of these type classes * Port JRegex to be instances of these type classes (links to PCRE!) Then I or someone else can * Implement an efficient instance of Bytestring being handled by PCRE. I expect step zero will be "Make a darcs repository" and step (-1) will be "Learn how to make a remotely accessible darcs repository" -- Chris

Hello Chris, Thursday, July 13, 2006, 12:17:30 PM, you wrote:
are you included Makefile? this makes building & installation somewhat simpler for a user
Yes, but the makefile is used just to compile Setup.hs to ./setup and tell the user to run that instead.
i've attached my Makefile. it somewhat simplifies building and installation of any lib by requiring only "make install" to do all 3 steps. i think that compiling setup.hs into executable don't make much sense - runghc is fast enough (comparing to real work perfromed by setup.hs) and creating executable file is not compatible with such systems as Hugs
Question 2: Is there interest in getting this into an official release of the base libraries? The Compat module could at least replace or sit alongside the performance sink of the current Text.Regex code.
i'm 120% want to see ByteString, regular expressions matching for String and ByteString, and JRegex (=~ operator implementation) to be included in GHC 6.6
That typeclass interface is very handy, BUT it expects the thing being matched against is a list of something. This prevents making ByteString an instance of RegexLike.
The answer will be to alter the type class to not make such an assumption. Luckily John Meacham put JRegex under the 3 clause BSD, so I will * Make a modified version of the type classes * Make Text.Regex.Lazy an instance of these type classes * Port JRegex to be instances of these type classes (links to PCRE!) Then I or someone else can * Implement an efficient instance of Bytestring being handled by PCRE.
regexps support for ByteStrings already exists: ========================================================================
btw, what will be really useful now, imho, is the interface to Text.Regex. how about working on it as next stage?
This is already done actually, here: http://www.cse.unsw.edu.au/~dons/code/lambdabot/Lib/Regex.hsc http://www.cse.unsw.edu.au/~dons/code/hmp3/Regex.hsc ======================================================================== well, i'm just dumb user telling what i want to see in GHC 6.6: * regexp matching for Strings and ByteStrings * perl-like syntax for doing it * ability to select regexp engine for each matching operation and using of most efficient ones (Lazy for String, posix or pcre (?) for ByteString) by default i also know that Simon Marlow want to see JRegex(-like) engine included in 6.6 (see http://hackage.haskell.org/trac/ghc/ticket/710 ) what you mentioned is just implementation details for me, the dumb user :)
I expect step zero will be "Make a darcs repository" and step (-1) will be "Learn how to make a remotely accessible darcs repository"
http://www.abridgegame.org/darcs/manual/bigpage.html -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Chris,
Thursday, July 13, 2006, 12:17:30 PM, you wrote:
Question 2: Is there interest in getting this into an official release of the base libraries? The Compat module could at least replace or sit alongside the performance sink of the current Text.Regex code. i'm 120% want to see ByteString, regular expressions matching for String and ByteString, and JRegex (=~ operator implementation) to be included in GHC 6.6
That typeclass interface is very handy, BUT it expects the thing being matched against is a list of something. This prevents making ByteString an instance of RegexLike.
The answer will be to alter the type class to not make such an assumption. Luckily John Meacham put JRegex under the 3 clause BSD, so I will * Make a modified version of the type classes * Make Text.Regex.Lazy an instance of these type classes * Port JRegex to be instances of these type classes (links to PCRE!) Then I or someone else can * Implement an efficient instance of Bytestring being handled by PCRE.
regexps support for ByteStrings already exists:
========================================================================
btw, what will be really useful now, imho, is the interface to Text.Regex. how about working on it as next stage?
This is already done actually, here: http://www.cse.unsw.edu.au/~dons/code/lambdabot/Lib/Regex.hsc http://www.cse.unsw.edu.au/~dons/code/hmp3/Regex.hsc ========================================================================
Thanks, I'll go take a look at that. I have pcre + JRegex installed now. And I have a remote darcs repository with my current version imported. (URL coming after I am sure it won't get re-organized).
well, i'm just dumb user telling what i want to see in GHC 6.6:
* regexp matching for Strings and ByteStrings * perl-like syntax for doing it * ability to select regexp engine for each matching operation and using of most efficient ones (Lazy for String, posix or pcre (?) for ByteString) by default
i also know that Simon Marlow want to see JRegex(-like) engine included in 6.6 (see http://hackage.haskell.org/trac/ghc/ticket/710 )
what you mentioned is just implementation details for me, the dumb user :)
As a user, the JRegex API can also only support a single Regex type and a single backend. But it would be really handy to be able to use different types of regular expressions. Mainly there are going to be different regex syntax possibilities: * Old Text.Regex syntax, also emulated by Text.Regex.Lazy.Compat * The "Full" syntax of Text.Regex.Lazy (close to Extended regex) * regex.h syntax (perhaps Basic as well as Extended) * pcre.h syntax All of these might conceivably come in [Word8] and [Char] sources. The backend will vary: at least because we will want both a Lazy version and a hand-off to pcre library version (if installed) or regex library (more likely to be installed). And the plan is to generalize the target to be either [Char] or ByteString. New Question: What do people think is the best way to use data/newtype/class to allow for 1) Different regex syntax as different types 2) Different target [Char] or ByteString 3) Different engine in the back end. My first thought is that the type of the regex encodes both which syntax is in use and which back-end will be used. Something like "Hello" =~ (pcre "el+") would use PCRE syntax and pcre library backend against the [Char]. And (pack "Hello") =~ (compatRE "el+") Would use the old Text.Regex syntax and my lazy backend against the ByteString produced by pack. Other answers? -- Chris

Hello Chris, Friday, July 14, 2006, 8:05:50 PM, you wrote:
2) Different target [Char] or ByteString
just to let you know - there is also lazy ByteString datatype which is essentially [ByteString]. at the last end, RE support for it will be also demanded
My first thought is that the type of the regex encodes both which syntax is in use and which back-end will be used. Something like
"Hello" =~ (pcre "el+")
would use PCRE syntax and pcre library backend against the [Char]. And
(pack "Hello") =~ (compatRE "el+")
Would use the old Text.Regex syntax and my lazy backend against the ByteString produced by pack.
Other answers?
i will be very pleased if it will be possible to write just String as regexp and don't worry (and even don't know!) about existence of different regexp engines until i actually need specific one: "Hello" =~ "el+" pack "Hello" =~ "el+" while for requesting specific regex engine your syntax is really great! -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Chris Kuklewicz wrote:
New Question: What do people think is the best way to use data/newtype/class to allow for 1) Different regex syntax as different types
perhaps modules. (e.g. import Text.Regex.Posix.Extended)
2) Different target [Char] or ByteString
overloading
3) Different engine in the back end.
definitely modules for this one, because different engines might be provided by separate packages. I suspect that most uses of regexes don't care much about the engine used, and one syntax covers most uses (e.g. extended regex.h, which is the default syntax used by Text.Regex). So Text.Regex should be mapped to something like Text.Regex.Posix.Extended, with overloading to provide the =~ operator with [Char] or ByteString. Cheers, Simon

On Fri, Jul 14, 2006 at 05:05:50PM +0100, Chris Kuklewicz wrote:
As a user, the JRegex API can also only support a single Regex type and a single backend. But it would be really handy to be able to use different types of regular expressions. Mainly there are going to be different regex syntax possibilities:
This isn't true, the API is a class, you can create as many instances as you like for it. In fact, it comes with at least 2 back ends, and at least a couple different instances for the regex syntax. It was specifically designed as a framework for many regular expression backends to be used via a common and useful interface. John -- John Meacham - ⑆repetae.net⑆john⑈

John Meacham wrote:
On Fri, Jul 14, 2006 at 05:05:50PM +0100, Chris Kuklewicz wrote:
As a user, the JRegex API can also only support a single Regex type and a single backend. But it would be really handy to be able to use different types of regular expressions. Mainly there are going to be different regex syntax possibilities:
This isn't true, the API is a class, you can create as many instances as you like for it. In fact, it comes with at least 2 back ends, and at least a couple different instances for the regex syntax. It was specifically designed as a framework for many regular expression backends to be used via a common and useful interface.
John
JRegex does require the source too be a list [x]:
class RegexContext x a where (=~) :: RegexLike r x => [x] -> r -> a (=~~) :: (Monad m, RegexLike r x) => [x] -> r -> m a
class RegexLike r a | r -> a where matchTest :: r -> [a] -> Bool matchCount :: r -> [a] -> Int matchAll :: r -> [a] -> [(Array Int (Int,Int))] matchOnce :: r -> [a] -> Bool -> Maybe (Array Int (Int,Int))
The List requirement precludes a ByteString instance. The functional dependency "r->a" also prevents mixing different backends with different data source types. The Bool parameter to matchOnce is there so matchAll can be implemented in terms of matchOnce, exploiting the fact that the source data type is a list. (Though this is not very optimal compared to a specialized matchAll). I am done rewriting the Posix regex and PCRE code with both String and ByteString as instances. The latest type classes (from today) look like:
type MatchArray = Array Int (Int,Int) -- (starting index,length)
class (RegexOptions regex compOpt execOpt) => RegexMaker regex source where makeRegex :: source -> regex makeRegexOpts :: compOpt -> execOpt -> source -> regex
class RegexLike regex source where matchAll :: regex -> source -> [MatchArray] matchCount :: regex -> source -> Int matchOnce :: regex -> source -> Maybe MatchArray matchTest :: regex -> source -> Bool matchTest regex source = isJust (matchOnce r s) matchCount regex source = length (matchAll r s)
I have omitted the RegexOptions class for space ( The job of the "Bool" to matchOnce is subsumed by the more general execOpt handling ). Clearly I have taken the names and most of the types from JRegex. I don't have the cool polymorphic RegexContext yet, but that is the next step. Once the code stabilizes at all, I will post a link to the development darcs address. The flexibility of source data type and backend is provided by making WrapPosix.hsc and WrapPRCE.hsc modules that expect a source type of CString/CStringLen and are comprehensive enough so that the four files (Byte)?String(Posix|PCRE) are just .hs files (no -cpp needed) instead of a .hsc files. And these four use optimized routines for match(All|Count|Once|Test) instead of using either of the defaults. The next backend to make instances for will be my "Text.Regex.Lazy" one based on Parsec. Then I will have 3 backends and two data source types, making 6 combinations. For example: I can compile a String regex as a PCRE and match that against a ByteString and against a String. The type of the regex source and the type of the data are separate. And I can make a new Regex from an old one with different execution options. -- Chris

Hello Chris, Monday, July 24, 2006, 5:59:33 PM, you wrote:
The next backend to make instances for will be my "Text.Regex.Lazy" one based on Parsec. Then I will have 3 backends and two data source types, making 6 combinations. For example: I can compile a String regex as a PCRE and match that against a ByteString and against a String. The type of the regex source and the type of the data are separate. And I can make a new Regex from an old one with different execution options.
this looks really great. i hope that your work will be included in next GHC version together with FPS lib, making the complete string processing solution. it will be great in some future to work via Stringable class (darcs get --partial http://darcs.haskell.org/SoC/fps-soc/ ) - afaik, it should became a part of FPS library -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (4)
-
Bulat Ziganshin
-
Chris Kuklewicz
-
John Meacham
-
Simon Marlow