
Hi Patrick,
Is it just me, or is HXT slow? I noticed that both reading a document from a file, as well as running computations, are exceedingly slow, with simple stuff like 'get the contents of everything with a given class' taking .3 seconds for a 400KB HTML file in Python using lxml and 2 seconds using HXT with tagSoup and compiled with -O2.
The tagsoup parser is currently the slowest parser in HXT. The native one is about twice as fast, but there are still some performance problems due to unwanted laziness. We are working on this. Usually the runtime is spend in parsing, because of the expensive handling of character input, traversing a tree and selecting some components is rather efficient compared to parsing. In the upcomming release there will be a binding to the expat parser via hexpat. This head version is already available on github ( https://github.com/UweSchmidt/hxt ). When you compare runtimes of various parsers, please take into account, what kind of functionality the parsers provide. If you want a standard parser, and not just a parser that scans a few angle bracket, you have to do a bit more than reading a few chars and checking, whether they are in a specific char range. These check and transformations are not for free. Cheers, Uwe