
Hi! I am really missing the (general) split function built in standard Haskell. I do not understand why there is something so specific as words and lines but not a simple split? The same goes for join. Yes, I can of course define them but ... in an interactive mode it would be quite handy to have them there. Or am I wrong and are those hidden somewhere? So what are common ways to get around this? What are elegant definitions? Inline definitions? Mitar

Mitar wrote:
Hi!
I am really missing the (general) split function built in standard Haskell. I do not understand why there is something so specific as words and lines but not a simple split? The same goes for join.
Yes, I can of course define them but ... in an interactive mode it would be quite handy to have them there.
Or am I wrong and are those hidden somewhere?
So what are common ways to get around this? What are elegant definitions? Inline definitions?
As I understand it, they don't exist because nobody can agree on the best way to define them. For joining you probably want some combination of intersperse and concat, e.g. unlines = concat . intersperse "\n" For splitting, do we split on a given character? A predicate? Do we keep the splitting character or throw it away? Do we generate empty sublists or elide them? Apparently nobody can agree on these points, and writing a function with all possible options would be very messy...

Hi all,
On Dec 28, 2007 12:38 PM, Andrew Coppin
For joining you probably want some combination of intersperse and concat, e.g.
unlines = concat . intersperse "\n"
And that's what we have :-) Data.List.intercalate :: [a] -> [[a]] -> [a] Data.List.intercalate x = concat . intersperse x
For splitting, do we split on a given character? A predicate? Do we keep the splitting character or throw it away? Do we generate empty sublists or elide them? Apparently nobody can agree on these points, and writing a function with all possible options would be very messy...
If you use intercalate to join, I would presume that you would want to use an inverse of it to split. I'd write it like this: split :: Eq a => [a] -> [a] -> [[a]] split at xs | Just xs' <- stripPrefix at xs = [] : split at xs' split at (x:xs) = (x:r) : rs where (r:rs) = split at xs split at [] = [[]] --with, if your version of the libraries is as old as mine and doesn't have Data.List.stripPrefix, stripPrefix (p:ps) (x:xs) | p == x = stripPrefix ps xs stripPrefix [] xs = Just xs stripPrefix _ _ = Nothing - Benja

On Dec 28, 2007 9:51 AM, Benja Fallenstein
If you use intercalate to join, I would presume that you would want to use an inverse of it to split. I'd write it like this:
But alas, words and lines differ on how properly to split, so there's no hint from the standard library which is preferred. Of course, there is no inverse to intercalate, so if you want to use a "logical" approach, perhaps you'd want to define split first, and then define your join as the inverse of split. Darcs uses this latter approach, which gives us a version of "lines" that isn't quite the same as the one in the prelude. David

On Dec 28, 2007 3:55 PM, David Roundy
On Dec 28, 2007 9:51 AM, Benja Fallenstein
wrote: If you use intercalate to join, I would presume that you would want to use an inverse of it to split. I'd write it like this:
Of course, there is no inverse to intercalate
Right; I misspoke. What I meant was that you would want a split such that intercalate a (split a xs) = a for finite, total (a,xs) (and, since it's achievable, even for infinite xs). Of course, (split a xs = [xs]) satisfies that, but if we add the requirement that split is also supposed to do its job :-) then I think split is fully specified except for whether (split a [] = []) or (split a [] = [[]]). The latter seems better to me; e.g., it satisfies split a (x ++ a ++ y) = split a x ++ split a y
so if you want to use a "logical" approach, perhaps you'd want to define split first, and then define your join as the inverse of split.
If your join comes out as being intercalate, I suppose it's six of one, half a dozen of the other :-) - Benja

On Dec 28, 2007 4:24 PM, Benja Fallenstein
Right; I misspoke. What I meant was that you would want a split such that
intercalate a (split a xs) = a
for finite, total (a,xs) (and, since it's achievable, even for infinite xs). Of course, (split a xs = [xs]) satisfies that, but if we add the requirement that split is also supposed to do its job :-) then I think split is fully specified except for whether (split a [] = []) or (split a [] = [[]]).
I take that back; it doesn't specify whether (split "xx" "xxx") should be ["","x"] or ["x",""]. I prefer the former, because working left-to-right seems natural, and because it makes split more lazy (it can yield ("":_) before evaluating the input enough to see the third 'x').
The latter seems better to me; e.g., it satisfies
split a (x ++ a ++ y) = split a x ++ split a y
I take that back, too: split "xx" "xxxx" = ["","",""] split "xx" "x" ++ split "xx" "xxx" = ["x"] ++ ["","x"] ("...but, still..." :-)) - Benja

Programmer with perl background would think split like: <list of string> = split <regex> <original string> Since regex is involved, it is specific to (Byte)String, not a generic list. Also it appears one would need help from Text.Regex(.PCRE) to do that.
intercalate a (split a xs) = a
Right; I misspoke. What I meant was that you would want a split such
This identity rule does not hold for perl's join/split if regex is used.
Steve
-----Original Message-----
On Dec 28, 2007 4:24 PM, Benja Fallenstein
intercalate a (split a xs) = a
------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates (which may be known outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as Banyu - direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system. ------------------------------------------------------------------------------

Lihn, Steve wrote:
Programmer with perl background would think split like: <list of string> = split <regex> <original string> Since regex is involved, it is specific to (Byte)String, not a generic list. Also it appears one would need help from Text.Regex(.PCRE) to do that.
intercalate a (split a xs) = a
This identity rule does not hold for perl's join/split if regex is used.
Steve
Well, libpcre does not have a split function. One can already write a split that uses the high level Regex API. The only reason you might want to dig into regex-pcre's Text.Regex.PCRE would be if it would help efficiency. Specifically, regex-base defines a RegexContext instance which is: ( RegexLike a b => RegexContext a b (b, b, b) ) : The text before the match, the text of the match, the text after the match So you can iteratively generate the pieces that split returns. -- Chris

Hi!
On Dec 28, 2007 5:51 PM, Lihn, Steve
Since regex is involved, it is specific to (Byte)String, not a generic list.
Oh, this gives me an interesting idea: making regular expressions more generic. Would not it be interesting and useful (but not really efficient) to have patterns something like: foo :: Eq a => a -> ... foo (_{4}'b') = ... which would match a list with four elements ending with an element 'b'. Or: foo (_+';'_+';'_) = ... which would match a list with embedded two ';' elements. (Last _ matched the remaining of the list.) OK, maybe guards are not the proper place to implement this as would add a possibility to make a really messy Haskell programs. But extending regular expressions to work on any list of elements with type implementing Eq would be realy powerfull. And then we could use split in many other than just text processing contexts. Of course, the problem in both cases is implementing something like regular expressions efficiently, especially on lists, but this is why there are smart people around. :-) Mitar

On Fri, Dec 28, 2007 at 11:40:24PM +0100, Mitar wrote:
Hi!
On Dec 28, 2007 5:51 PM, Lihn, Steve
wrote: Since regex is involved, it is specific to (Byte)String, not a generic list.
Oh, this gives me an interesting idea: making regular expressions more generic.
Would not it be interesting and useful (but not really efficient) to have patterns something like:
foo :: Eq a => a -> ... foo (_{4}'b') = ...
Your idea has precedent, albeit only in an april fool's joke. http://www.dcs.gla.ac.uk/~partain/haskerl/partain-1.html (section 5, "(Polymorphic) Regular expressions") Stefan

Would not it be interesting and useful (but not really efficient) to have patterns something like:
foo :: Eq a => a -> ... foo (_{4}'b') = ...
which would match a list with four elements ending with an element 'b'. Or:
foo (_+';'_+';'_) = ...
Maybe you could use view patterns? foo (regex "(.*);(.*);(.*)") -> [c1, c2, c3] = ...
OK, maybe guards are not the proper place to implement this as would add a possibility to make a really messy Haskell programs. But extending regular expressions to work on any list of elements with type implementing Eq would be realy powerfull. And then we could use split in many other than just text processing contexts.
Of course, the problem in both cases is implementing something like regular expressions efficiently, especially on lists, but this is why there are smart people around. :-)
Parser combinators basically provide generalized regexes, and they all take lists of arbitrary tokens rather than just Chars. I've written a simple combinator library before that dispenses with all the monadic goodness in favor of a group combinator and returning [Either [tok] [tok]], which sort of gives parsers a simpler regexy flavor (Left is "out of group chunk" and Right is "in group chunk"). foo (match (group any `sepBy` char ';') -> [c1, c2, c3]) = ...

Hi!
On Dec 29, 2007 12:13 AM, Evan Laforge
Maybe you could use view patterns?
foo (regex "(.*);(.*);(.*)") -> [c1, c2, c3] = ...
Oh. Beautiful. :-)
Parser combinators basically provide generalized regexes, and they all take lists of arbitrary tokens rather than just Chars. I've written a simple combinator library before that dispenses with all the monadic goodness in favor of a group combinator and returning [Either [tok] [tok]], which sort of gives parsers a simpler regexy flavor (Left is "out of group chunk" and Right is "in group chunk").
foo (match (group any `sepBy` char ';') -> [c1, c2, c3]) = ...
Ah. Is this accessible somewhere? Mitar

Parser combinators basically provide generalized regexes, and they all take lists of arbitrary tokens rather than just Chars. I've written a simple combinator library before that dispenses with all the monadic goodness in favor of a group combinator and returning [Either [tok] [tok]], which sort of gives parsers a simpler regexy flavor (Left is "out of group chunk" and Right is "in group chunk").
foo (match (group any `sepBy` char ';') -> [c1, c2, c3]) = ...
Ah. Is this accessible somewhere?
Unfortunately it's just a toy since it uses the inefficient naive parser combinator thing. I wrote it while reading one of those pre-monadic parser combinator papers. A better version could wrap parsec or ReadP (which unfortunately doesn't seem to be parameterized on the token type). Anyway, in case you're still interested: http://ofb.net/~elaforge/hs/group_parse.hs Excuse the poor quality, it was a long time ago. At the time I was thinking of a regex->parser combinator compiler, with the catch being that you can extend the regex language to include your own parsers, like: let env = [("num", number_in_range 0 255), ("dotted", many1 letter `sepBy` char '.')] p = from_regex env "<num>, +<num>: (<dotted>)" in parse p input You could even have a magic character that makes them be postfix operators, and include combinators:, like "[a-z]<:comma_sep>", but you're probably better off writing a real parser here.

Mitar wrote:
Hi!
On Dec 28, 2007 5:51 PM, Lihn, Steve
wrote: Since regex is involved, it is specific to (Byte)String, not a generic list.
Oh, this gives me an interesting idea: making regular expressions more generic.
The new regex-base API is fairly generic. If you look at the classes in regex-base's Text.Regex.RegexLike: class Extract source => RegexLike regex source where matchAll :: regex -> source -> [MatchArray] matchOnce :: regex -> source -> Maybe MatchArray matchCount :: regex -> source -> Int matchTest :: regex -> source -> Bool matchAllText :: regex -> source -> [MatchText source] matchOnceText :: regex -> source -> Maybe (source, MatchText source, source) you can see that the "regex" type parameter is fully abstract, and that the "source" being searched is also fully abstract. The reason for having those specific class methods is to allow for the instance to expose the most efficient way to do each operation. You could make an instance for string seaching (e.g. KMP or BM searching). Pretty much any "search" or "find" operation could be made into an instance of RegexLike. The main constraint is that the MatchArray/MatchText use Int indexing and the Extract instance wants to be able to do lookup with this: type MatchOffset = Int type MatchLength = Int type MatchArray = Array Int (MatchOffset, MatchLength) type MatchText source = Array Int (source, (MatchOffset, MatchLength)) class Extract source where before :: Int -> source -> source after :: Int -> source -> source empty :: source extract :: (Int, Int) -> source -> source One benefit is that all the RegexContext instances are implemented by using just the above class methods, so all the polymorphic match/matchM will immediately work. If there is ever a strong need for going beyond the range of Int indexing, then one could either make new variants of the classes, or add methods to the existing ones. But if you are searching over 2GB of something, then perhaps have this generic type class API is not the top priority.
Would not it be interesting and useful (but not really efficient) to have patterns something like:
foo :: Eq a => a -> ... foo (_{4}'b') = ...
which would match a list with four elements ending with an element 'b'. Or:
foo (_+';'_+';'_) = ...
which would match a list with embedded two ';' elements. (Last _ matched the remaining of the list.)
OK, maybe guards are not the proper place to implement this as would add a possibility to make a really messy Haskell programs. But extending regular expressions to work on any list of elements with type implementing Eq would be realy powerfull. And then we could use split in many other than just text processing contexts.
Of course, the problem in both cases is implementing something like regular expressions efficiently, especially on lists, but this is why there are smart people around. :-)
Mitar

On Dec 28, 2007 11:40 PM, Mitar
Would not it be interesting and useful (but not really efficient) to have patterns something like:
foo :: Eq a => a -> ... foo (_{4}'b') = ...
which would match a list with four elements ending with an element 'b'. Or:
foo (_+';'_+';'_) = ...
which would match a list with embedded two ';' elements. (Last _ matched the remaining of the list.)
I suggest you take at look at HaRP, Haskell Regular Patterns: http://www.cs.chalmers.se/~d00nibro/harp/ It hasn't been updated for a while but it should still be useable. Cheers, Josef

Josef Svenningsson writes:
On Dec 28, 2007 11:40 PM, Mitar
wrote: Would not it be interesting and useful (but not really efficient) to have patterns something like:
foo :: Eq a => a -> ... foo (_{4}'b') = ...
which would match a list with four elements ending with an element 'b'. Or:
foo (_+';'_+';'_) = ...
which would match a list with embedded two ';' elements. (Last _ matched the remaining of the list.)
I suggest you take at look at HaRP, Haskell Regular Patterns: http://www.cs.chalmers.se/~d00nibro/harp/
It hasn't been updated for a while but it should still be useable.
Also of interest might be XHaskell http://taichi.ddns.comp.nus.edu.sg/taichiwiki/XhaskellHomePage which adds XDuce style regular expression pattern matching to Haskell. Martin

On Fri, Dec 28, 2007 at 04:24:38PM +0100, Benja Fallenstein wrote:
On Dec 28, 2007 3:55 PM, David Roundy
wrote: On Dec 28, 2007 9:51 AM, Benja Fallenstein
wrote: If you use intercalate to join, I would presume that you would want to use an inverse of it to split. I'd write it like this:
Of course, there is no inverse to intercalate
Right; I misspoke. What I meant was that you would want a split such that
intercalate a (split a xs) = a
for finite, total (a,xs) (and, since it's achievable, even for infinite xs). Of course, (split a xs = [xs]) satisfies that, but if we add the requirement that split is also supposed to do its job :-) then I think split is fully specified except for whether (split a [] = []) or (split a [] = [[]]). The latter seems better to me; e.g., it satisfies
split a (x ++ a ++ y) = split a x ++ split a y
Yes, the latter is what darcs' linesPS does.
so if you want to use a "logical" approach, perhaps you'd want to define split first, and then define your join as the inverse of split.
If your join comes out as being intercalate, I suppose it's six of one, half a dozen of the other :-)
Well, your intercalate "\n" is not the same as "unlines" and the inverse of intercalate "\n" is not the same as lines, nor is its inverse (with " ") the same as words. It is true that intercalate " " is the same as unwords, however. So it does seem like the prelude doesn't really give us any hints as to what would be a useful generic join/split pair. -- David Roundy Department of Physics Oregon State University

So a simple thing occured to me today. Rather than worry about the correct behavior for a join/split pair, we should just add unintercalate to the library. A bit verbose as a name, but at least there's no ambiguity. --s On Dec 29, 2007, at 2:18 PM, David Roundy wrote:
On Fri, Dec 28, 2007 at 04:24:38PM +0100, Benja Fallenstein wrote:
On Dec 28, 2007 3:55 PM, David Roundy
wrote: On Dec 28, 2007 9:51 AM, Benja Fallenstein
wrote: If you use intercalate to join, I would presume that you would want to use an inverse of it to split. I'd write it like this:
Of course, there is no inverse to intercalate
Right; I misspoke. What I meant was that you would want a split such that
intercalate a (split a xs) = a
for finite, total (a,xs) (and, since it's achievable, even for infinite xs). Of course, (split a xs = [xs]) satisfies that, but if we add the requirement that split is also supposed to do its job :-) then I think split is fully specified except for whether (split a [] = []) or (split a [] = [[]]). The latter seems better to me; e.g., it satisfies
split a (x ++ a ++ y) = split a x ++ split a y
Yes, the latter is what darcs' linesPS does.
so if you want to use a "logical" approach, perhaps you'd want to define split first, and then define your join as the inverse of split.
If your join comes out as being intercalate, I suppose it's six of one, half a dozen of the other :-)
Well, your intercalate "\n" is not the same as "unlines" and the inverse of intercalate "\n" is not the same as lines, nor is its inverse (with " ") the same as words. It is true that intercalate " " is the same as unwords, however. So it does seem like the prelude doesn't really give us any hints as to what would be a useful generic join/split pair. -- David Roundy Department of Physics Oregon State University _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Albert Y. C. Lai wrote:
Mitar wrote:
I am really missing the (general) split function built in standard Haskell. I do not understand why there is something so specific as words and lines but not a simple split? The same goes for join.
Don't forget Text.Regex.splitRegex.
Which is just:
matchRegexAll p str = matchM p str
{- | Splits a string based on a regular expression. The regular expression should identify one delimiter.
This is unsafe if the regex matches an empty string. -}
splitRegex :: Regex -> String -> [String] splitRegex _ [] = [] splitRegex delim str = case matchRegexAll delim str of Nothing -> [str] Just (firstline, _, remainder, _) -> if remainder == "" then firstline : [] : [] else firstline : splitRegex delim remainder
Inlining the matchRegexAll/matchM means this is 8 lines of code. Any given split function is very short, but there are enough design choices that I think the best library is none at all; the user can write exactly what they want in <= 10 lines of code. Though now that I look at it again, I think I like
splitRegex :: Regex -> String -> [String] splitRegex _ [] = [] splitRegex delim strIn = loop strIn where loop str = case matchM delim str of Nothing -> [str] Just (firstline, _, remainder) -> if null remainder then [firstline,""] else firstline : loop remainder
slightly better. I'll eventually update the unstable regex-compat. -- Chris
participants (13)
-
Albert Y. C. Lai
-
Andrew Coppin
-
Benja Fallenstein
-
ChrisK
-
David Roundy
-
David Roundy
-
Evan Laforge
-
Josef Svenningsson
-
Lihn, Steve
-
Martin Sulzmann
-
Mitar
-
Stefan O'Rear
-
Sterling Clover