Re: [Haskell-cafe] Re: Why is there no splitBy in the list module?

On 2006-07-12 at 23:24BST "Brian Hulley" wrote:
Christian Maeder wrote:
Donald Bruce Stewart schrieb:
Question over whether it should be: splitBy (=='a') "aabbaca" == ["","","bb","c",""] or splitBy (=='a') "aabbaca" == ["bb","c"]
I argue the second form is what people usually want.
Yes, the second form is needed for "words", but the first form is needed for "lines", where one final empty element needs to be removed from your version!
Prelude> lines "a\nb\n" ["a","b"] Prelude> lines "a\n\nb\n\n" ["a","","b",""]
Prelude.lines and Prelude.unlines treat '\n' as a terminator instead of a separator. I'd argue that this is poor design, since information is lost ie lines . unlines === id whereas unlines . lines =/= id whereas if '\n' had been properly conceived of as a separator, the identity would hold.
Hooray! I've been waiting to ask "Why aren't we asking what laws hold for these operations?" but now you've saved me the effort. I've been bitten by unlines . lines /= id already; it's something we could gainfully change without wrecking too much code, methinks.
So I vote for the first option ie:
splitBy (=='a') "aabbaca" == ["","","bb","c",""]
Seconded. As far as naming is concerned, since this is a declarative language, surely we shouldn't be using active verbs like this? (OK I lost that argument way back in the mists of Haskell 0.0 with take. Before then I called "take" "first": "first n some_list" reads perfectly well). Jón -- Jón Fairbairn Jon.Fairbairn at cl.cam.ac.uk

As someone who's not used these library methods before, I would expect splitBy and splitLines to work differently to each other. When splitting into lines, I would assume that it is repeatedly applying the regular expression "([^t]*) (t|$)" where t is the line-terminator. You return the first group each time, and discard the rest. The 2nd group also handles the end-of-string boundary condition. As others have said, I would expect splitBy to return all of the zero-length matches as well - interlieving a "[^t]*" match-and-return with a "t" match-and-discard. The collapsed form of the output is the same as interleving a "[^t]" match-and-return with a "t*" match-and-discard. Matthew On Thursday 13 July 2006 10:16, Jon Fairbairn wrote:
On 2006-07-12 at 23:24BST "Brian Hulley" wrote:
Christian Maeder wrote:
Donald Bruce Stewart schrieb:
Question over whether it should be: splitBy (=='a') "aabbaca" == ["","","bb","c",""] or splitBy (=='a') "aabbaca" == ["bb","c"]
I argue the second form is what people usually want.
Yes, the second form is needed for "words", but the first form is needed for "lines", where one final empty element needs to be removed from your version!
Prelude> lines "a\nb\n" ["a","b"] Prelude> lines "a\n\nb\n\n" ["a","","b",""]
Prelude.lines and Prelude.unlines treat '\n' as a terminator instead of a separator. I'd argue that this is poor design, since information is lost ie lines . unlines === id whereas unlines . lines =/= id whereas if '\n' had been properly conceived of as a separator, the identity would hold.
Hooray! I've been waiting to ask "Why aren't we asking what laws hold for these operations?" but now you've saved me the effort. I've been bitten by unlines . lines /= id already; it's something we could gainfully change without wrecking too much code, methinks.
So I vote for the first option ie:
splitBy (=='a') "aabbaca" == ["","","bb","c",""]
Seconded.
As far as naming is concerned, since this is a declarative language, surely we shouldn't be using active verbs like this? (OK I lost that argument way back in the mists of Haskell 0.0 with take. Before then I called "take" "first": "first n some_list" reads perfectly well).
Jón

On 2006-07-13 at 10:16BST I wrote:
Hooray! I've been waiting to ask "Why aren't we asking what laws hold for these operations?"
Having thought about this for a bit, I've come up with the below. This is intended to give the general idea -- it's not polished code, and I'm not at all wedded to the names I've used, and it almost certainly should be split up.
module Parts (parts, fromParts, contiguousParts, segmentsSatisfying) where import List (groupBy)
parts p = map hack . groupBy sameSide . map (predicateToEither p) fromParts = concat . map fromEither
Now we should have fromParts . parts p ⊑ (id:: [a]->[a]) In particular, it should be possible to apply a function to all the Right parts, and then reconstruct the list with the Left parts left alone. for example fromParts . mapRight uc_first . parts Char.isAlpha $ "A random list of words\non lines" where uc_first [] = [] uc_first (a:as) = Char.toUpper a:as => "A Random List Of Words\nOn Lines"
contiguousParts p l = [a | Right a <- parts p l]
so words = contiguousParts Char.isAlphaNum
segmentsSatisfying predicate = concat . map dropSeps . parts predicate where dropSeps e = case e of Left x -> map (const []) $ tail x Right r -> [r]
So lines = segmentsSatisfying (/= '\n') ... but the tail in the definition of segmentsSatisfying makes me uneasy. needing the function `hack` suggests that the definition of parts is written badly
hack (Left x:rest) = Left (x: map (\(Left x) -> x) rest) hack (Right x:rest) = Right (x: map (\(Right x) -> x) rest)
what follows must surely exist in a library somewhere? I'd expect it to be called Either...
predicateToEither :: (a -> Bool) -> a -> Either a a predicateToEither p x = if p x then Right x else Left x
sameSide (Left _) (Left _) = True sameSide (Right _) (Right _) = True sameSide _ _ = False
fromEither (Left x) = x fromEither (Right x) = x
liftE f1 f2 = either (Left . f1) (Right . f2)
mapRight f = map (onRight f)
onRight f = liftE id f
mapLeft f = map (onLeft f) onLeft f = liftE f id
we could do some of half of those using this:
instance Functor (Either a) where fmap f (Right a) = Right (f a) fmap f (Left l) = Left l
Is a Monad instance any use?
instance Monad (Either a) where Right a >>= f = f a Left l >>= f = Left l return = Right
-- Jón Fairbairn Jon.Fairbairn at cl.cam.ac.uk

On 2006-07-20 at 18:31BST I wrote:
On 2006-07-13 at 10:16BST I wrote:
Hooray! I've been waiting to ask "Why aren't we asking what laws hold for these operations?"
Having thought about this for a bit, I've come up with the below. This is intended to give the general idea -- it's not polished code,
And already some changes (but all to comments) I'd appreciate comments on the law below. --- lib/split.lhs 2006/07/21 08:54:28 1.1 +++ lib/split.lhs 2006/07/21 09:42:49 @@ -17,10 +17,17 @@ => "A Random List Of Words\nOn Lines" + I think the relevant law is + (forall x. all (not . p) (map f x)) => + segmentsSatisfying (not . p) . fromParts . mapRight f . parts p + == segmentsSatisfying (not . p) + + In other words, if f doesn't add any "not . p" elements, + the segments satisfying "not . p" are unchanged.
contiguousParts p l = [a | Right a <- parts p l]
- so words = contiguousParts Char.isAlphaNum + so words = contiguousParts (not . Char.isSpace)
segmentsSatisfying predicate = concat . map dropSeps . parts predicate
-- Jón Fairbairn Jon.Fairbairn at cl.cam.ac.uk

Jon Fairbairn schrieb:
module Parts (parts, fromParts, contiguousParts, segmentsSatisfying) where
ok, the functions "parts" has added value, but I find its type (using "Either") ugly. parts :: (a -> Bool) -> [a] -> [Either [a] [a]]
import List (groupBy)
parts p = map hack . groupBy sameSide . map (predicateToEither p)
I'ld rather see: parts' :: (a -> Bool) -> [a] -> [[a]] parts' p = groupBy (\ a b -> p a == p b) or: parts'' :: (a -> Bool) -> [a] -> [[(Bool, a)]] parts'' p = groupBy (\ a b -> fst a == fst b) . map (\ a -> (p a, a)) parts' has a simple type and a simple property: id = concat . parts p but the predicate p is needed again to find out the matching parts (or if the alternating list of matching and non-matching sublists starts with a matching or non-matching sublist.) The function parts'' could be refined to: parts''' :: (a -> Bool) -> [a] -> [(Bool, [a])] parts''' p = map (\ l@((b, _) : _) -> (b, map snd l)) . parts'' p
segmentsSatisfying predicate = concat . map dropSeps . parts predicate where dropSeps e = case e of Left x -> map (const []) $ tail x Right r -> [r]
So
lines = segmentsSatisfying (/= '\n')
... but the tail in the definition of segmentsSatisfying makes me uneasy.
The (undocumented) property of groupBy is that all element lists are non-empty!
needing the function `hack` suggests that the definition of parts is written badly
hack (Left x:rest) = Left (x: map (\(Left x) -> x) rest) hack (Right x:rest) = Right (x: map (\(Right x) -> x) rest)
This hack function would make me more uneasy (if it was exported), because it only works on non-empty and "sameSided" lists. The only function that I'm missing is something to manipulate both arguments of a binary function: binComp :: (a -> b) -> (b -> b -> c) -> a -> a -> c binComp f g a b = g (f a) (f b) This would allow to reformulate my above definitions as: parts' p = groupBy (binComp p (==)) parts'' p = groupBy (binComp fst (==)) . map (\ a -> (p a, a)) Furthermore, "binComp" would simplify the manipulation of the compare function. It would make the "comparing" function from http://www.haskell.org/hawiki/ThingsToAvoid obsolete: comparing p x y = compare (p x) (p y) sortBy (comparing (map toLower)) because, one could write: sortBy (binComp (map toLower) compare) Cheers Christian

On Jul 20, 2006, at 1:31 PM, Jon Fairbairn wrote:
On 2006-07-13 at 10:16BST I wrote:
Hooray! I've been waiting to ask "Why aren't we asking what laws hold for these operations?"
Having thought about this for a bit, I've come up with the below. This is intended to give the general idea -- it's not polished code, and I'm not at all wedded to the names I've used, and it almost certainly should be split up.
[snip an interesting new take on splitting strings]
-- Jón Fairbairn Jon.Fairbairn at cl.cam.ac.uk
Inspired by this, I've hacked together my own version, based around the ideas of list deforestation. I've taken some liberties with the function names. In particular, I've split Jon's 'parts' function into two pieces, called 'classify' and 'parts'. Code follows. I haven't tested it a lot, but things seem to work in ghci. In particular I've not tested the RULES. I have no idea if this is better or not in terms of performance, but it seems a little cleaner to me. Rob Dockins Speak softly and drive a Sherman tank. Laugh hard; it's a long way to the bank. -- TMBG {-# OPTIONS -fglasgow-exts #-} module Parts where import GHC.Exts import Data.Char data PartList a = LeftPart a (PartList a) | RightPart a (PartList a) | PartNil foldrParts :: (a -> b -> b) -> (a -> b -> b) -> b -> PartList a -> b foldrParts l r n (LeftPart x xs) = l x (foldrParts l r n xs) foldrParts l r n (RightPart x xs) = r x (foldrParts l r n xs) foldrParts l r n PartNil = n buildParts :: (forall b. (a -> b -> b) -> (a -> b -> b) -> b -> b) -> PartList a buildParts g = g LeftPart RightPart PartNil {-# RULES "foldrParts/buildParts" forall l r z (g::forall b. (a->b->b) -> (a-
b->b) -> b -> b) . foldrParts l r z (buildParts g) = g l r z "foldrParts/id" foldrParts LeftPart RightPart PartNil = id #-}
classify :: (a -> Bool) -> [a] -> PartList a classify p zs = buildParts (\l r n -> foldr (\x xs -> if p x then r x xs else l x xs) n zs) parts :: PartList a -> PartList [a] parts = foldrParts fLeft fRight PartNil where fLeft x xs = LeftPart (case xs of (LeftPart y _ ) -> x:y; _ -> x:[]) (case xs of (LeftPart _ ys) -> ys; _ -> xs) fRight x xs = RightPart (case xs of (RightPart y _ ) -> x:y; _ -> x:[]) (case xs of (RightPart _ ys) -> ys; _ -> xs) partsSep :: PartList a -> PartList [a] partsSep = foldrParts fLeft fRight PartNil where fLeft x xs = RightPart [] xs fRight x xs = RightPart (case xs of (RightPart y _ ) -> x:y; _ -> x:[]) (case xs of (RightPart _ ys) -> ys; _ -> xs) fromParts :: PartList a -> [a] fromParts xs = build (\c n -> foldrParts c c n xs) {- loose proof, ignore seq... forall p xs. fromParts (classify p xs) === build (\c n -> foldrParts c c n (classify p xs) (definition of fromParts) === build (\c n -> foldrParts c c n (buildParts (\l r n -
(definition of classify)
foldr (\x xs -> if p x then r x xs else l x xs) n xs))) === build (\c n -> foldr (\x xs -> if p x then c x xs else c x xs) n xs) (deforestation conjecture) === build (\c n -> foldr (\x xs -> c x xs) n xs) (by indifference on p x) === foldr (\x xs -> (:) x xs) [] xs) (definition of build) === foldr (:) [] xs (eta contraction) === xs (well known) -} {- forall p xs. concat (fromParts (parts (classify p xs))) === concat (fromParts (foldrParts fLeft fRigth PartNil (classify p xs))) (definition of parts) === concat (fromParts (foldrParts fLeft fRigth PartNil (buildParts (\l r n -> (definition of classify) foldr (\x xs -> if p x then r x xs else l x xs) n xs))) (deforestation conjecture) === concat (fromParts (foldr (\x xs -> if p x then fRight x xs else fLeft x xs) PartNil xs)) === concat (build (\c n -> foldrParts c c n (foldr (definition of fromParts) (\x xs -> if p x then fRight x xs else fLeft x xs) PartNil xs))) === foldr (++) [] (build (\c n -> foldrParts c c n (foldr (definition of concat) (\x xs -> if p x then fRight x xs else fLeft x xs) PartNil xs))) === foldrParts (++) (++) [] (foldr (list deforestation) (\x xs -> if p x then fRight x xs else fLeft x xs) PartNil xs))) ??? I think the rest of the proof can go through via the approximation lemma, or maybe some more inlining is sufficient... === xs -} splitBy :: (a -> Bool) -> [a] -> [[a]] splitBy p = fromParts . parts . classify p contiguousParts :: (a -> Bool) -> [a] -> [[a]] contiguousParts p xs = build (\c n -> foldrParts (\_ x -> x) c n (parts (classify p xs))) segmentsSatisfying :: (a -> Bool) -> [a] -> [[a]] segmentsSatisfying p = fromParts . partsSep . classify p splitOn :: Eq a => a -> [a] -> [[a]] splitOn x = splitBy (==x) words :: String -> [String] words = contiguousParts isAlphaNum lines :: String -> [String] lines = segmentsSatisfying (/= '\n')
participants (4)
-
Christian Maeder
-
Jon Fairbairn
-
Matthew Pocock
-
Robert Dockins