Searching of several substrings (with Data.Text ?)

Hi, For my font library I need A function that can handle ligatures. It can be explained best with an example: f ["Th", "ff", "fi", "fl", "ffi"] "The fluffiest bunny" should be evaluated to ["Th", "e", " ", "fl", "u", "ffi", "e", "s", "t", " ", "b", "u", "n", "n", "y" ] I looked at Data.Text http://hackage.haskell.org/packages/archive/text/0.5/doc/html/Data-Text.html and http://hackage.haskell.org/packages/archive/stringsearch/0.3.3/doc/html/Data... but they don't have a function that can search several substrings in one run. I guess that searching a text again and again for every substring is not very efficient and it can be done in one run. Although I may figure this out myself I think such a function could be so common that someone has done it or can give me some tips. Thanks

On Tuesday 05 July 2011, 20:01:26, Tillmann Vogt wrote:
Hi,
For my font library I need A function that can handle ligatures. It can be explained best with an example:
f ["Th", "ff", "fi", "fl", "ffi"] "The fluffiest bunny"
should be evaluated to
["Th", "e", " ", "fl", "u", "ffi", "e", "s", "t", " ", "b", "u", "n", "n", "y" ]
I looked at Data.Text http://hackage.haskell.org/packages/archive/text/0.5/doc/html/Data-Text. html and http://hackage.haskell.org/packages/archive/stringsearch/0.3.3/doc/html/ Data-ByteString-Search.html
but they don't have a function that can search several substrings in one run.
Well, Data.ByteString[.Lazy].KarpRabin does provide simultaneous search for multiple substrings - it does not, however, provide direct splitting. But in my tests, unless the number of substrings was large, multiple separate (Boyer-Moore) passes with manual merging of the offset lists were much faster [there's a possible space leak for lazy ByteStrings if any pattern does not appear in a long substring of the source, so that would be a point for Data.ByteString.Lazy.KarpRabin], and I didn't know of any real use-case, so I did not pursue it further and considered it just an interesting curiosity. I suppose using a regex package as Johannes Waldmann suggested would be the easiest way (probably also faster). If you submit a feature request, however, I would look into expanding the offered functionality (but I'll be on vacation soon, so performance would have to wait; I suppose something could be gained there).
I guess that searching a text again and again for every substring is not very efficient and it can be done in one run.
Well, it is surprisingly efficient, compared to (my implementation of) the Karp-Rabin algorithm at least.
Although I may figure this out myself I think such a function could be so common that someone has done it or can give me some tips.
Thanks
Cheers, Daniel

On Tue, Jul 5, 2011 at 11:01 AM, Tillmann Vogt wrote: I looked at Data.Text http://hackage.haskell.org/**
packages/archive/text/0.5/doc/**html/Data-Text.htmlhttp://hackage.haskell.org/packages/archive/text/0.5/doc/html/Data-Text.html
and http://hackage.haskell.org/**packages/archive/stringsearch/**
0.3.3/doc/html/Data-**ByteString-Search.htmlhttp://hackage.haskell.org/packages/archive/stringsearch/0.3.3/doc/html/Data... but they don't have a function that can search several substrings in one
run. Here's what you want:
http://hackage.haskell.org/packages/archive/text-icu/0.6.3.4/doc/html/Data-T...

I've been looking into building parsers at runtime (from a config
file), and in my case it's beneficial to fit them into the context of
a larger parser with Attoparsec.Text. This code is untested for
practical use so I doubt you'll see comparable performance to the
aforementioned regex packages, but it could be worth exploring if you
need to mix and match parsers or if the definitions can change
arbitrarily at runtime.
import qualified Data.Text as T
import Data.Attoparsec.Text
import Control.Applicative ((<|>))
parseLigature x = string (T.pack x)
charToText = do c <- anyChar
return (T.singleton c)
buildChain [x] = parseLigature x
buildChain (x:xs) = try (parseLigature x) <|> buildChain xs
-- ordering matters here, so "ffi" comes before "ff" or "fi"
ligatures = buildChain ["ffi", "th", "ff", "fi", "fl"]
myParser = many (try ligatures <|> charToText)
-- at ghci prompt: parseOnly myParser (T.pack "the fluffiest bunny")
-- Right ["th","e"," ","fl","u","ffi","e","s","t"," ","b","u","n","n","y"]
On Tue, Jul 5, 2011 at 12:09 PM, Bryan O'Sullivan
On Tue, Jul 5, 2011 at 11:01 AM, Tillmann Vogt
wrote: I looked at Data.Text http://hackage.haskell.org/packages/archive/text/0.5/doc/html/Data-Text.html and http://hackage.haskell.org/packages/archive/stringsearch/0.3.3/doc/html/Data...
but they don't have a function that can search several substrings in one run.
Here's what you want: http://hackage.haskell.org/packages/archive/text-icu/0.6.3.4/doc/html/Data-T... _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Am 05.07.2011 21:29, schrieb Eric Rasmussen:
I've been looking into building parsers at runtime (from a config file), and in my case it's beneficial to fit them into the context of a larger parser with Attoparsec.Text. This code is untested for practical use so I doubt you'll see comparable performance to the aforementioned regex packages, but it could be worth exploring if you need to mix and match parsers or if the definitions can change arbitrarily at runtime.
import qualified Data.Text as T import Data.Attoparsec.Text import Control.Applicative ((<|>))
parseLigature x = string (T.pack x)
charToText = do c<- anyChar return (T.singleton c)
buildChain [x] = parseLigature x buildChain (x:xs) = try (parseLigature x)<|> buildChain xs
-- ordering matters here, so "ffi" comes before "ff" or "fi" ligatures = buildChain ["ffi", "th", "ff", "fi", "fl"]
myParser = many (try ligatures<|> charToText)
-- at ghci prompt: parseOnly myParser (T.pack "the fluffiest bunny") -- Right ["th","e"," ","fl","u","ffi","e","s","t"," ","b","u","n","n","y"]
Of course parsec! I should have thought of this. icu seems to be the best solution (I already considered it for parsing character references), but it is not so easy to install on windows. So I wait until cabal does this or it is integrated into the haskell platform. Thank you all for your help (especially the attoparsec example)
On Tue, Jul 5, 2011 at 12:09 PM, Bryan O'Sullivan
wrote: On Tue, Jul 5, 2011 at 11:01 AM, Tillmann Vogt
wrote: I looked at Data.Text http://hackage.haskell.org/packages/archive/text/0.5/doc/html/Data-Text.html and http://hackage.haskell.org/packages/archive/stringsearch/0.3.3/doc/html/Data...
but they don't have a function that can search several substrings in one run. Here's what you want: http://hackage.haskell.org/packages/archive/text-icu/0.6.3.4/doc/html/Data-T...
Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (5)
-
Bryan O'Sullivan
-
Daniel Fischer
-
Eric Rasmussen
-
Johannes Waldmann
-
Tillmann Vogt