[egenix-users] RfC: mx.TextTools Non-recursive Implementation
Mike C. Fletcher
mcfletch at rogers.com
Mon Jun 17 02:02:05 CEST 2002
I've been working on a non-recursive implementation of the mx.TextTools
core loop this weekend, and am to the point where I could use some help
in moving it forward. (I'm posting to the users list because I figure
people other than just Marc-André might be interested in hacking on the
package).
As of now, I seem to have a functional, but still very rough, version of
the engine. It passes all of the unit tests I have (including 67 for
the low-level mx.TextTools functionality (excluding Loop commands) and
another 170 for SimpleParse functionality).
Why:
The current loop uses a recursive call of the matching function for
every table/subtable. As a result, we tend to blow up the stack if we
have large numbers of recursive loops. The non-recursive rewrite
attempts to eliminate that problem. This allows a number of common EBNF
grammar structures to be used which are otherwise almost impossible to
support cleanly.
The new loop structure consolidates all the return value and error
reporting code, which substantially reduces the total amount of code in
the module.
The new loop structure should make it very easy to add "user error"
classes, and to report the hierarchy of parents during user-error reporting.
It should make it possible (with some thought-work) to add robust
backtracking support as well (i.e. recording backtracks as frame-stacks
with some extra information for re-starting)). Similarly complex (and
basically the same mechanism) would be adding re-start functionality on
premature EOF cases.
If I'm understanding correctly, it should eventually be possible to
make the code process memory-mapped files (this is actually not specific
to the new code, but it would be nice to have with either version).
The new loop is fairly heavily commented by me, as I was figuring
out what needed to get done and why. It is also more rigid in it's
structure, only allowing individual commands to access a defined set of
variables to signal the result/error-handling clauses. (This results in
a slight slowdown, of course, compared to just setting the global
variables).
Tradeoffs:
The (unoptimised) first version of the code is ~ 75% of the speed
of the 2.1.0b1 version from which it was derived (tested using the
HTML.py script from mx.TextTools). I'd guess that can be improved to 80
or 90%, but I don't think there's any way this approach can be as fast
as the recursive call + direct goto-jumping approach. Why:
It's doing all the work of copying variables into new (heap)
memory manually (versus stack-pushing/call, which I'm assuming is very
fast in C).
Side-effect-encapsulation overhead (See last point above).
It's using a loop, so winds up traversing the loop code on each
iteration, whereas the goto just does a jump.
The code is new (therefore largely untested), and is written by a
first-time C coder (me). I've tried to follow generally clean
programming practices, but it's still going to need review by
experienced C coders (both to check for errors and to optimise it).
There are at least a few bugs in the module (I get a "no current
thread" error, and there's probably one or two ref-count bugs in there).
There is a warning generated that "not all paths return a value" (I
can't figure out which paths those are just from looking).
Needed work:
1) Some sort of documentation and unit tests for the Loop api.
2) Review of the code by experienced C coder(s), particularly for
lost-references, improper memory/pointer usage, and that kind of stuff
common to people who've never dealt with C code before.
3) Optimisation.
4) Expanded testing regiment.
5) Add support for memory-mapped files.
How it Works:
Following is rough pseudo-code for how the new engine works. It's a
simple structure compared to the base code, which defined goto jumps for
error handling, finish and next item, and had the result handling for
each command in the definition of the command itself (with a call to
another function to do the actual appending/calling).
while 1:
while (index_in_table() and returnCode == NULL_CODE):
decode the current table[index]
if childReturnCode is NULL_CODE:
#the current tag is not already processed
reset tag variables
switch( tag command ):
do what tag wants to do()
set tag-related variables
set childReturnCode (tag variable)
if table:
push_frame_stack()
set childReturnCode == PENDING
switch(childReturnCode):
# figure out what to do with child's results
# possibly set table-wide returnValue
childSuccess
append values
update table-wide values
set new index
childFailure
rewind position
set new index
(possibly set table-wide return code)
childError
signal error for whole table
childPending
ignore/continue processing without updating list values
(start a new table)
reset childReturnCode
#done table, figure out what to do now...
if no explicit return value:
figure out implicit
if failure:
truncate result list to previous length
reset position
if error:
report error as exception
exit
else:
if frame_stack():
record table vars as child vars()
pop_frame_stack()
set childReturnCode
else:
return result
Where it is:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/simpleparse/
What's there:
The code is currently just a set of 5 files which replace the file
mxte_impl.h in the 2.1.0 mx.TextTools source tree. Dropping them in
should allow you to compile the new version without changing anything
else (use build --force, of course).
mxte_impl.h --> the loop described above and the (trivial)
implementation of the stack. Quite a few comments. I'm guessing the
stack implementation could be much faster, I just don't know enough
about C to guess how.
*commands.h --> each of the command-families from the original code
broken out into a file to make it easier to work with the core loop.
Each command group has a description of the "contract" it works under in
the system. The commands tend to be much lighter (in terms of amount of
code included) because of the new result-handling framework.
In the SimpleParse package you'll find a tests directory. There
are only a few of those tests still up-to-date (sorry, this directory
was a quick merge of 2 different versions of the project):
mx_* --> low-level direct tests of the primary features of the
engine (these should eventually move to the Egenix package, likely
(they aren't specific to the rewrite, so they should be able to test the
original code as well)). Run mx_test.py to run all of these.
test_objectgenerator.py --> tests the individual components
generated by SimpleParse for use in parsing
test_simpleparsegrammar.py --> tests the SimpleParse
parser-generator (i.e. goes from text definitions to parsers and tests
those parsers).
If there are people interested in working on the code, I'd love a few
eyes to pick it over and/or suggest better approaches. People who have
feedback regarding whether they might find the updated engine useful or
not might want to share their thoughts as well.
With thanks for your time in listening, and for Marc-André's great work
in producing TextTools,
Mike
_______________________________________
Mike C. Fletcher
http://members.rogers.com/mcfletch/
More information about the egenix-users
mailing list