[Haskell-cafe] Please review my Xapian foreign function interface

18 Feb 2011

      Hello!

I've finally came up with some motivation for a project to get my feet
wet using Haskell, and for this little pet project I need an interface
to Xapian. After reading various documents on FFI in general, I've got a
brief working implementation, and I'm now looking for how to better
structure the public API. First, a quick bit of background if you're not
familiar with Xapian.

Xapian is a search engine, and provides a C++ API. You store documents
in a database (handled by Xapian), and index documents by adding terms
to them. Xapian provides stemming algorithms to help generate these
terms from other data. Xapian also has an interface to queries (through
a Xapian::Enquire object), and also a query parser to allow for natural
language queries to be parsed and ran. For more information, you can
check out the API at [1] - it's fairly small.

As Xapian is C++, it seems my best option is to create my own simple C
wrapper, which also lets me tailor my FFI to be easy to use from
Haskell. You can see my C api on Github [2] - for now it's very stripped
down; I've been wrapping stuff on a need-to-use basis.

* * *

Currently what I have is functional (in the sense that it works), but
it's extremely tied to I/O and very little of the code is pure. For
example, to create and index a document, you need to do something along
the lines of:

    do document <- newDocument
       setDocumentData document "Document data"
       addPosting document "search_term" 1
       addDocument database document

(Assuming you already have an open database handle). How horrible
imperative this all looks! :-) A document *feels* like it should be
quite pure, however retrieving properties of a document performs
I/O. For example, I'd like to have something like:

    data Document = Document { data :: String, postings :: [String] }
    do document <- getDocument database 123 -- Get doc #123

and have `document` refer to a pure Document object. I'm still stuck in
the IO monad a bit, but at least I can write pure functions to operate
on `Document` values now. The problem I see with this, is that I believe
I'd have to retrieve all parts of document in my `getDocument` function
(include the data and all postings), and I can't benefit from being lazy
here.
...
From what I gather, all the methods on Xapian documents are lazy (such
as getting the document data, and getting terms associated with
documents), which would mean that my foreign imports would have to be
`IO String`, for example. This tends to fairly quick cause the IO monad
to propogate everywhere.
* * *

I think that's enough information to explain my current progress, and my
concerns. It could well be that I'm overly worrying about everything
being in the IO monad, but as I said - Haskell is new to me.

All of my work is at [3], and I'd love any advice you have. Haddock
documents have been exported to ocharles.org.uk [4].

Thanks for your time,

Oliver Charles / ocharles

--

[1]: http://xapian.org/docs/apidoc/html/annotated.html
[2]: https://github.com/ocharles/Xapian-Haskell/blob/master/c/cxapian.h
[3]: https://github.com/ocharles/Xapian-Haskell
[4]: http://ocharles.org.uk/tmp/search-xapian/Search-Xapian.html