Hi,

I 've got a lot of files which I need to proces in order to make them indexable by sphinx.

The files contain the data of a website with a custom perl based cms. Unfortunatly they sometimes contain xml/html tags like <i>

And since most of the texts are in dutch and some are in French they also contain a lot of special characters like ë é, ...

I'm trying to replace the custom based perl based cms by a haskell one. And I would like to add search capability. Since someone wrote sphinx

bindings a few weeks ago I thought I try that.

But transforming the files in something that sphinx seems a challenge. Most special character problems seem to go aways when I use encodeString (Codec.Binary.UTF8.String)

on the indexable data.

But the sphinx indexer complains that the xml isn't valid. When I look at the errors this seems due to some documents containing not well formed html.

I would like to use a programmatic solution to this problem.

And is there some haskell function which converts special tokens lik & -> & and é -> &egu; ?

thanks in advance,

Pieter

--
Pieter Laeremans <pieter@laeremans.org>

"The future is here. It's just not evenly distributed yet." W. Gibson