
* Anton Tayanovskyy
So you want to encode priorities efficiently as far as I understand from [1]? Could bit-packing combined with prefix elimination do the trick? Choice boils down to binary choice. Attach a number N to every execution thread that sits in a given NFA state. When the thread moves through a binary choice state, it splits into two threads with N_{left} = 2 * N + 1 and N_{right} = 2 * N. When two threads arrive at the same NFA state, the machine picks one with greater N and discards the other, thus giving priority to early over late and left over right. This makes these numbers grow quickly, but at any point in the simulation one can identify the common binary prefix of all the thread numbers and remove it, because it will not be relevant for comparison. If this is too costly, amortize and do it every K steps instead of on every step.
Anton
[1] https://github.com/feuerbach/regex-applicative/wiki/Call-For-Help
Just common prefix elimination is not sufficient to put a hard bound on the memory usage. For example, something like /(a*)|((aa)*)/ would still require an amount of space linear in the string length (if the string consists of many a's). More clever techniques that I tried ("monotonically map lists to smaller lists") are hard to make correct. I like the approach of Russ Cox[2]. One of the great ideas there (which I think he didn't emphasize enough) is to have a queue which allows O(1) testing whether an element is already there [3]. This solves the problem with priorities -- the states are guaranteed to be enqueued in the order of their priorities, and there are no repetitions. To be able to process the states in the proper order I'll have to give up Glushkov automaton and probably use something similar to your construction [4]. [2] http://swtch.com/~rsc/regexp/regexp2.html [3] http://code.google.com/p/re2/source/browse/util/sparse_array.h [4] http://t0yv0.blogspot.com/2011/07/combinatory-regular-expressions-in.html -- Roman I. Cheplyaka :: http://ro-che.info/