[egenix-users] Is this possible ?
Pekka Niiranen
krissepu at vip.fi
Sat Aug 3 23:50:31 CEST 2002
Ok,
I did as you told:
-- code starts --
from mx.TextTools import *
text = "Xaa(AA)a((BB))aa((CC)DD)aa(EE(FF))aa(GG(HH(II)JJ)KK)aaY"
tables = []
tab = ('start',
(None, AllNotIn,'()', +1),
(None, Is+LookAhead, '(', MatchOk, 'nesting'),
'nesting',
('group',SubTable+AppendMatch,((None, Is, '(', +1),
(None, SubTableInList, (tables,0)),
(None, Is, ')', MatchFail, MatchOk))),
(None,Jump,To,'start'))
tables.append(tab) # Add tab to tables
if __name__ == '__main__':
result, taglist, nextindex = tag(text,tab)
print taglist
-- code ends --
There remains one quirk (see code above):
The code stops searching whenever there is an extra ) -sign in the middle of text.
How can I make the engine to return nothing (i.e. empty match)
if there are extra ) -sign AND it is not recursing currently ?
Should we have a parameter: "Fail if not currently recursing" ?
Try adding ) -sign after X -letter and then after Y -letter in text above. In both cases
the result should be an empty match.
This is a matter of taste, I agree, but then one could always say:
"It did not find anything, because of the number of (- and )- signs
did not add up". => one python error message that
I could print when MatchFail happens => less analysing to do => more speed.
In code above only extra ( -signs make engine fail.
-pekka-
"M.-A. Lemburg" wrote:
> Pekka Niiranen wrote:
> > Fine,
> >
> > but the line:
> >
> > (None,EOF,Here,MatchOk)
> >
> > will make text = "aa(AA" match too. If I analysed it correctly,
> > it is because EOF matches allways. Would it be possible
> > to add mxTextTool parameter that will make EOF cause failing if necessary ?
> >
> > Something like: "if EOF is encountered here, fail the whole subgroup ?"
>
> EOF only matches iff the head position is beyond the right slice
> of the text slice being processed. If you need balanced parens,
> you should rewrite the tab tables to have the nesting table match
> both the opening and the closing paren.
>
> > -pekka-
> >
> >
> > "M.-A. Lemburg" wrote:
> >
> >
> >>Pekka Niiranen wrote:
> >>
> >>>Hi,
> >>>
> >>> I tried the latest beta 3 by:
> >>>
> >>> a) compiling it myself from sources and
> >>> b) installing from the precompiled package for python v2.2
> >>>
> >>> Of the scripts below only the script that uses Simpleparse returns
> >>>anything.
> >>> The others run without errors, but return [].
> >>>
> >>> They all run OK with the beta 2 though.
> >>
> >>If they did, then you've hit a bug in beta2. Here are the corrected
> >>versions. Note that the problem was with the EOF handling. If AllNotIn
> >>doesn't match at least one char it'll fail and using 0 as jne offset
> >>causes the same effect as MatchFail.
> >>
> >>#--- solution 1 starts (with limiting letters)---
> >>
> >>from mx.TextTools import *
> >>
> >>def test1():
> >>
> >> text = "aa(AA)a((BB))aa((CC)DD)aa(EE(FF))aa(GG(HH(II)JJ)KK)aa"
> >>
> >> tables = [] # used for recursion only
> >>
> >> tab = ('start',
> >> (None,Is+LookAhead,'(',+1,'nesting'), # If next character is "(" then recurse
> >> (None,Is,')',+1,MatchOk), # If current character is ")" then stop or return from recursion
> >> (None,AllNotIn,'()',+1,'start'), # Search all characters except "(" and ")"
> >> (None,EOF,Here,MatchOk),
> >> 'nesting',
> >> ('group',SubTable+AppendMatch,
> >> ((None,Is,'(',MatchFail,+1), # Since we have looked ahead, collect "(" -sign
> >> (None,SubTableInList, (tables,0)), # Recurse
> >> )
> >> ),
> >> (None,Jump,To,'start')) # After recursion jump back to 'start'
> >>
> >> tables.append(tab) # Add tab to tables
> >>
> >> result, taglist, nextindex = tag(text,tab)
> >> print result, nextindex
> >> print taglist
> >>
> >>#--- solution 2 starts (without limiting letters) ---
> >>
> >>def test2():
> >>
> >> text = "aa(AA)a((BB))aa((CC)DD)aa(EE(FF))aa(GG(HH(II)JJ)KK)aa"
> >>
> >> tab = ('start',
> >> (None, Is+LookAhead, ')', +1, MatchOk), # When character ")" is seen stop recursion
> >> (None, Is, '(', 'letters', +1),
> >> ('group', SubTable+AppendMatch, ThisTable), # Recurse
> >> (None, Skip, 1, MatchFail, 'start'), # Last character in recursion was ")" so jump over it back to 'start'
> >> 'letters',
> >> (None, AllNotIn, '()', +1, 'start'), # Collect all characters except "(" and ")"
> >> (None, EOF, Here, MatchOk),
> >> )
> >>
> >> result,taglist,nextindex = tag(text, tab)
> >> print result, nextindex
> >> print taglist
> >>
> >>print 'Test 1:'
> >>test1()
> >>print
> >>
> >>print 'Test 2:'
> >>test2()
> >>print
> >>
> >>
> >>> I am using Windows 2000 professional, Python 2.2.1 and Winpython
> >>>v148.
> >>>
> >>>-pekka-
> >>>
> >>>
> >>>Pekka Niiranen wrote:
> >>>
> >>>
> >>>
> >>>>Thank you all for your help and inspiration! It is payback time ;)
> >>>>
> >>>>I have tried past two months to create parser that returns
> >>>>strings limited by two different letters. The strings can be nested.
> >>>>I considered recursive call of regular expression to be too slow
> >>>>and decided to use mxTextTools 2.1 beta2 and the latest alpha of
> >>>>Simpleparse 2.0.
> >>>>
> >>>>Below are three solutions I found.
> >>>>Note that Simpleparse creates different tagtable as the "manually"
> >>>>found.
> >>>>
> >>>>Further ideas to be implemented:
> >>>>
> >>>>1) Input of limiting letters as parameters (easy)
> >>>>2) Unicode support
> >>>>3) Test for equal amount of limiting letters before calling of parser
> >>>>(will this speed up the solution ?)
> >>>>4) Parsing one line at a time without looping thru lines of the text
> >>>>with "while" or "for"
> >>>> (maybe "None, AllNotIn, '()\n'" )
> >>>>
> >>>>One development idea to mxTextTools:
> >>>>
> >>>>1) Instead of using list of tables to recurse, would it be possible to
> >>>>use "global jump" to outside of current table ?
> >>>>
> >>>>--- solution 1 starts (with limiting letters)---
> >>>>
> >>>
> >>>>from mx.TextTools import *
> >>>
> >>>>text = "aa(AA)a((BB))aa((CC)DD)aa(EE(FF))aa(GG(HH(II)JJ)KK)aa"
> >>>>tables = [] # used for recursion only
> >>>>
> >>>>tab = ('start',
> >>>> (None,Is+LookAhead,'(',+1,'nesting'), # If next character is "("
> >>>>then recurse
> >>>> (None,Is,')',+1,MatchOk), # If current character is ")" then stop
> >>>>or return from recursion
> >>>> (None,AllNotIn,'()',0,'start'), # Search all characters except
> >>>>"(" and ")"
> >>>> 'nesting',
> >>>> ('group',SubTable+AppendMatch,((None,Is,'(',0,+1), # Since we
> >>>>have looked ahead, collect "(" -sign
> >>>> (None,SubTableInList,
> >>>>(tables,0)))), # Recurse
> >>>> (None,Jump,To,'start')) # After recursion jump back to 'start'
> >>>>
> >>>>tables.append(tab) # Add tab to tables
> >>>>
> >>>>if __name__ == '__main__':
> >>>>
> >>>> result, taglist, nextindex = tag(text,tab)
> >>>> print taglist
> >>>>
> >>>>--- solution 1 ends ---
> >>>>
> >>>>--- solution 2 starts (without limiting letters) ---
> >>>>
> >>>
> >>>>from mx.TextTools import *
> >>>
> >>>>text = "aa(AA)a((BB))aa((CC)DD)aa(EE(FF))aa(GG(HH(II)JJ)KK)aa"
> >>>>
> >>>>tab = ('start',
> >>>> (None, Is+LookAhead, ')', +1, MatchOk), # When character ")" is
> >>>>seen stop recursion
> >>>> (None, Is, '(', 'letters', +1),
> >>>> ('group', SubTable+AppendMatch, ThisTable), # Recurse
> >>>> (None, Skip, 1, 0, 'start'), # Last character in recursion was
> >>>>")" so jump over it back to 'start'
> >>>> 'letters',
> >>>> (None, AllNotIn, '()', 0, 'start')) # Collect all characters
> >>>>except "(" and ")"
> >>>>
> >>>>result,taglist,next = tag(text, tab)
> >>>>print taglist
> >>>>
> >>>>--- solution 2 ends ---
> >>>>
> >>>>--- solution 3 starts (Simpleparse solution) ---
> >>>>
> >>>
> >>>>from simpleparse.parser import Parser
> >>>>from mx.TextTools import *
> >>>
> >>>>declaration = r'''
> >>>>
> >>>>
> >>>>>line< := (a/match)+
> >>>>
> >>>>match := '(', line, ')'
> >>>><a> := -[()]
> >>>>'''
> >>>>text = "aa(AA)a((BB))aa((CC)DD)aa(EE(FF))aa(GG(HH(II)JJ)KK)aa"
> >>>>
> >>>>parser = Parser(declaration)
> >>>>success, children, nextcharacter = parser.parse(text, production =
> >>>>"line")
> >>>>print_tags(text,children)
> >>>>
> >>>>--- solution 3 ends ---
> >>>>
> >>>>-pekka-
> >>>
> >>>
> >>>
> >>>_______________________________________________________________________
> >>>eGenix.com User Mailing List http://www.egenix.com/
> >>>http://lists.egenix.com/mailman/listinfo/egenix-users
> >>
> >>--
> >>Marc-Andre Lemburg
> >>CEO eGenix.com Software GmbH
> >>_______________________________________________________________________
> >>eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
> >>Python Consulting: http://www.egenix.com/
> >>Python Software: http://www.egenix.com/files/python/
> >>
> >>_______________________________________________________________________
> >>eGenix.com User Mailing List http://www.egenix.com/
> >>http://lists.egenix.com/mailman/listinfo/egenix-users
> >
> >
> >
> > _______________________________________________________________________
> > eGenix.com User Mailing List http://www.egenix.com/
> > http://lists.egenix.com/mailman/listinfo/egenix-users
>
> --
> Marc-Andre Lemburg
> CEO eGenix.com Software GmbH
> _______________________________________________________________________
> eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
> Python Consulting: http://www.egenix.com/
> Python Software: http://www.egenix.com/files/python/
More information about the egenix-users
mailing list