[egenix-users] SOLUTIONS: Parsing nested strings
Pekka Niiranen
krissepu at vip.fi
Wed Jul 17 00:22:49 CEST 2002
Thank you all for your help and inspiration! It is payback time ;)
I have tried past two months to create parser that returns
strings limited by two different letters. The strings can be nested.
I considered recursive call of regular expression to be too slow
and decided to use mxTextTools 2.1 beta2 and the latest alpha of
Simpleparse 2.0.
Below are three solutions I found.
Note that Simpleparse creates different tagtable as the "manually"
found.
Further ideas to be implemented:
1) Input of limiting letters as parameters (easy)
2) Unicode support
3) Test for equal amount of limiting letters before calling of parser
(will this speed up the solution ?)
4) Parsing one line at a time without looping thru lines of the text
with "while" or "for"
(maybe "None, AllNotIn, '()\n'" )
One development idea to mxTextTools:
1) Instead of using list of tables to recurse, would it be possible to
use "global jump" to outside of current table ?
--- solution 1 starts (with limiting letters)---
from mx.TextTools import *
text = "aa(AA)a((BB))aa((CC)DD)aa(EE(FF))aa(GG(HH(II)JJ)KK)aa"
tables = [] # used for recursion only
tab = ('start',
(None,Is+LookAhead,'(',+1,'nesting'), # If next character is "("
then recurse
(None,Is,')',+1,MatchOk), # If current character is ")" then stop
or return from recursion
(None,AllNotIn,'()',0,'start'), # Search all characters except
"(" and ")"
'nesting',
('group',SubTable+AppendMatch,((None,Is,'(',0,+1), # Since we
have looked ahead, collect "(" -sign
(None,SubTableInList,
(tables,0)))), # Recurse
(None,Jump,To,'start')) # After recursion jump back to 'start'
tables.append(tab) # Add tab to tables
if __name__ == '__main__':
result, taglist, nextindex = tag(text,tab)
print taglist
--- solution 1 ends ---
--- solution 2 starts (without limiting letters) ---
from mx.TextTools import *
text = "aa(AA)a((BB))aa((CC)DD)aa(EE(FF))aa(GG(HH(II)JJ)KK)aa"
tab = ('start',
(None, Is+LookAhead, ')', +1, MatchOk), # When character ")" is
seen stop recursion
(None, Is, '(', 'letters', +1),
('group', SubTable+AppendMatch, ThisTable), # Recurse
(None, Skip, 1, 0, 'start'), # Last character in recursion was
")" so jump over it back to 'start'
'letters',
(None, AllNotIn, '()', 0, 'start')) # Collect all characters
except "(" and ")"
result,taglist,next = tag(text, tab)
print taglist
--- solution 2 ends ---
--- solution 3 starts (Simpleparse solution) ---
from simpleparse.parser import Parser
from mx.TextTools import *
declaration = r'''
>line< := (a/match)+
match := '(', line, ')'
<a> := -[()]
'''
text = "aa(AA)a((BB))aa((CC)DD)aa(EE(FF))aa(GG(HH(II)JJ)KK)aa"
parser = Parser(declaration)
success, children, nextcharacter = parser.parse(text, production =
"line")
print_tags(text,children)
--- solution 3 ends ---
-pekka-
More information about the egenix-users
mailing list