After a number of algorithmic and implementation efficiency optimizations, I was able to get several times better decompose performance compared to the C implementation. I have not yet got a chance to fully optimize the compose operations but they are still as fast as utf8proc.
I would like to thank Antonio Nikishaev for the unicode character database parsing code which I borrowed from the prose library.
-harendra