
Hi Sjoerd,
It seems that only shift needs the reg field of the RegW datatype. So you can also replace the reg field with a shift field. This makes the regexp parser extensible, as there is no longer a dependence on the (closed) datatype Reg:
data RegW w c = RegW { active :: !Bool, empty :: !w, final_ :: !w, shift :: w -> c -> RegW w c }
Interesting observation. However, such an encoding would prevent the definition of some other functions on RegExp. More specifically, there are Show and Eq instances for the QuickCheck tests.
For example it is then easy to define the parser that matches nothing, which is the identity element of alt:
noMatch :: RegExp c noMatch = RegExp noMatchW
noMatchW :: Semiring w => RegW w c noMatchW = RegW False zero zero $ \_ _ -> noMatchW
Note that you can also define it with the current interface: noMatch :: RegExp c noMatch = psym "(Big Lambda)" (const False) Maybe I'll add it to the next version. I only need a better string representation ;)
But otherwise I do wonder if the parser needs to be extensible.
I have some ideas for extending the matcher. For example /a{2,5}/ is currently translated into /aa(a(a(a)?)?)?/ but it may be possible to handle it without such blowup. I also want to add substring matching, i.e., the possibility to find out against which strings parenthesized parts of the regexp were matched. But as the closed Reg type is not exported I can freely change it along with any matcher extension.
For example some XML Schema implementations that are based on finite automata have special cases for the xs:all construct, which matches a list of elements, each occurring once in any order. But I tried a straightforward implementation and it works fine:
eachOnce :: [RegExp c] -> RegExp c eachOnce [] = eps eachOnce ps = eachOnce' ps [] where eachOnce' [] _ = noMatch eachOnce' (p:ps) qs = (p `seq_` eachOnce (ps ++ qs)) `alt` eachOnce' ps (p:qs)
Neat! That's also worth adding. I find eachOnce :: [RegExp c] -> RegExp c eachOnce = foldr alt noMatch . map (foldr seq_ eps) . permutations even clearer but your version is *much* better as it uses nesting to combine all alternatives that start with the same regexp. Thanks! Sebastian -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)