
On Tuesday 05 July 2011, 20:01:26, Tillmann Vogt wrote:
Hi,
For my font library I need A function that can handle ligatures. It can be explained best with an example:
f ["Th", "ff", "fi", "fl", "ffi"] "The fluffiest bunny"
should be evaluated to
["Th", "e", " ", "fl", "u", "ffi", "e", "s", "t", " ", "b", "u", "n", "n", "y" ]
I looked at Data.Text http://hackage.haskell.org/packages/archive/text/0.5/doc/html/Data-Text. html and http://hackage.haskell.org/packages/archive/stringsearch/0.3.3/doc/html/ Data-ByteString-Search.html
but they don't have a function that can search several substrings in one run.
Well, Data.ByteString[.Lazy].KarpRabin does provide simultaneous search for multiple substrings - it does not, however, provide direct splitting. But in my tests, unless the number of substrings was large, multiple separate (Boyer-Moore) passes with manual merging of the offset lists were much faster [there's a possible space leak for lazy ByteStrings if any pattern does not appear in a long substring of the source, so that would be a point for Data.ByteString.Lazy.KarpRabin], and I didn't know of any real use-case, so I did not pursue it further and considered it just an interesting curiosity. I suppose using a regex package as Johannes Waldmann suggested would be the easiest way (probably also faster). If you submit a feature request, however, I would look into expanding the offered functionality (but I'll be on vacation soon, so performance would have to wait; I suppose something could be gained there).
I guess that searching a text again and again for every substring is not very efficient and it can be done in one run.
Well, it is surprisingly efficient, compared to (my implementation of) the Karp-Rabin algorithm at least.
Although I may figure this out myself I think such a function could be so common that someone has done it or can give me some tips.
Thanks
Cheers, Daniel