Lexical analysis – The tokenizer in our recursive descent parser
example was (for demonstration purposes) overly simple. You can always
write more complex tokenizers by hand. However, for more complex (and
real) tokenizers, you may want to use a tool to build your tokenizer.
In this section we’ll describe Plex and use it to produce a
tokenizer for our recursive descent parser.
You can obtain Plex
athttp://www.cosc.canterbury.ac.nz/~greg/python/Plex/.
In order to use it, you may want to add Plex-1.1.4/Plex to your
PYTHONPATH.
Here is a simple example from the Plex tutorial:
#!/usr/bin/env python # python_201_plex1.py # # Sample Plex lexer #
import sys import Plex def test(infileName): letter =
Plex.Range(“AZaz”) digit = Plex.Range(“09”) name = letter +
Plex.Rep(letter | digit) number = Plex.Rep1(digit) space = Plex.Any(”
tn”) comment = Plex.Str(‘”’) + Plex.Rep( Plex.AnyBut(‘”’)) +
Plex.Str(‘”’) resword = Plex.Str(“if”, “then”, “else”, “end”) lexicon
= Plex.Lexicon([ (resword, ‘keyword’), (name, ‘ident’), (number,
‘int’), ( Plex.Any(“+-*/=
System Message: WARNING/2 (/home/gerard/environments/sphinx-0.5-with-patch/thehazeltree/source/pylessons/python201/5.rst, line 528); backlink
Inline emphasis start-string without end-string.
"), Plex.TEXT),
(space, Plex.IGNORE),
(comment, 'comment'),
])
infile = open(infileName, "r")
scanner = Plex.Scanner(lexicon, infile, infileName)
while 1:
token = scanner.read()
position = scanner.position()
print '(%d, %d) tok: %s tokType: %s' % \
(position[1], position[2], token[1], token[0])
if token[0] is None:
break
USAGE_TEXT = """
Usage: python python_201_plex1.py
"""
def usage():
print USAGE_TEXT
sys.exit(-1)
def main():
args = sys.argv[1:]
if len(args) != 1:
usage()
infileName = args[0]
test(infileName)
if __name__ == '__main__':
main()
#import pdb
#pdb.run('main()')
Download as text (original file name: Examples/python_201_plex1.py).
Comments and explanation:
- Create a lexicon from scanning patterns.
- See the Plex tutorial and reference (and below) for more information
on how to construct the patterns that match various tokens.
- Create a scanner with a lexicon, an input file, and an input file
name.
- The call “scanner.read()” gets the next token. It returns a tuple
containing (1) the token value and (2) the token type.
- The call “scanner.position()” gets the position of the current
token. It returns a tuple containing (1) the input file name, (2) the
line number, and (3) the column number.
And, here are some comments on constructing the patterns used in a
lexicon:
- Range constructs a pattern that matches any character in the
range.
- Rep constructs a pattern that matches a sequence of zero or more
items.
- Rep1 constructs a pattern that matches a sequence ofone or more
items.
- “pat1 + pat2” constructs a pattern that matches a sequence
containing pat1 followed by pat2.
- “pat1 | pat2” constructs a pattern that matches eitherpat1 or pat2.
- Any constructs a pattern that matches any one character in its
argument.
Now let’s revisit our recursive descent parser, this time with a
tokenizer built with Plex. The tokenizer is trivial, but will
serve as an example of how to hook it into a parser.
#!/usr/bin/env python
"""
python_201_rparser_plex.py
A recursive descent parser example.
This example uses Plex to implement a tokenizer.
The grammar:
Prog ::= Command | Command Prog
Command ::= Func_call
Func_call ::= Term '(' Func_call_list ')'
Func_call_list ::= Func_call | Func_call ',' Func_call_list
Term =
"""
import sys, string, types
import getopt
import Plex
## from IPython.Shell import IPShellEmbed
## ipshell = IPShellEmbed((),
## banner = '>>>>>>>> Into IPython >>>>>>>>',
## exit_msg = '<<<<<<<< Out of IPython <<<<<<<<')
#
# Constants
#
# AST node types
NoneNodeType = 0
ProgNodeType = 1
CommandNodeType = 2
FuncCallNodeType = 3
FuncCallListNodeType = 4
TermNodeType = 5
# Token types
NoneTokType = 0
LParTokType = 1
RParTokType = 2
WordTokType = 3
CommaTokType = 4
EOFTokType = 5
# Dictionary to map node type values to node type names
NodeTypeDict = {
NoneNodeType: 'NoneNodeType',
ProgNodeType: 'ProgNodeType',
CommandNodeType: 'CommandNodeType',
FuncCallNodeType: 'FuncCallNodeType',
FuncCallListNodeType: 'FuncCallListNodeType',
TermNodeType: 'TermNodeType',
}
#
# Representation of a node in the AST (abstract syntax tree).
#
class ASTNode:
def __init__(self, nodeType, *args):
self.nodeType = nodeType
self.children = []
for item in args:
self.children.append(item)
def show(self, level):
self.showLevel(level)
print 'Node -- Type %s' % NodeTypeDict[self.nodeType]
level += 1
for child in self.children:
if isinstance(child, ASTNode):
child.show(level)
elif type(child) == types.ListType:
for item in child:
item.show(level)
else:
self.showLevel(level)
print 'Child:', child
def showLevel(self, level):
for idx in range(level):
print ' ',
#
# The recursive descent parser class.
# Contains the "recognizer" methods, which implement the grammar
# rules (above), one recognizer method for each production rule.
#
class ProgParser:
def __init__(self):
pass
def parseFile(self, infileName):
self.tokens = None
self.tokenType = NoneTokType
self.token = ''
self.lineNo = -1
self.infile = file(infileName, 'r')
self.tokens = genTokens(self.infile, infileName)
try:
self.tokenType, self.token, self.lineNo = self.tokens.next()
except StopIteration:
raise RuntimeError, 'Empty file'
result = self.prog_reco()
self.infile.close()
self.infile = None
return result
def parseStream(self, instream):
self.tokens = None
self.tokenType = NoneTokType
self.token = ''
self.lineNo = -1
self.tokens = genTokens(self.instream, '')
try:
self.tokenType, self.token, self.lineNo = self.tokens.next()
except StopIteration:
raise RuntimeError, 'Empty stream'
result = self.prog_reco()
self.infile.close()
self.infile = None
return result
def prog_reco(self):
commandList = []
while 1:
result = self.command_reco()
if not result:
break
commandList.append(result)
return ASTNode(ProgNodeType, commandList)
def command_reco(self):
if self.tokenType == EOFTokType:
return None
result = self.func_call_reco()
return ASTNode(CommandNodeType, result)
def func_call_reco(self):
if self.tokenType == WordTokType:
term = ASTNode(TermNodeType, self.token)
self.tokenType, self.token, self.lineNo = self.tokens.next()
if self.tokenType == LParTokType:
self.tokenType, self.token, self.lineNo = self.tokens.next()
result = self.func_call_list_reco()
if result:
if self.tokenType == RParTokType:
self.tokenType, self.token, self.lineNo = \
self.tokens.next()
return ASTNode(FuncCallNodeType, term, result)
else:
raise ParseError(self.lineNo, 'missing right paren')
else:
raise ParseError(self.lineNo, 'bad func call list')
else:
raise ParseError(self.lineNo, 'missing left paren')
else:
return None
def func_call_list_reco(self):
terms = []
while 1:
result = self.func_call_reco()
if not result:
break
terms.append(result)
if self.tokenType != CommaTokType:
break
self.tokenType, self.token, self.lineNo = self.tokens.next()
return ASTNode(FuncCallListNodeType, terms)
#
# The parse error exception class.
#
class ParseError(Exception):
def __init__(self, lineNo, msg):
RuntimeError.__init__(self, msg)
self.lineNo = lineNo
self.msg = msg
def getLineNo(self):
return self.lineNo
def getMsg(self):
return self.msg
#
# Generate the tokens.
# Usage - example
# gen = genTokens(infile)
# tokType, tok, lineNo = gen.next()
# ...
def genTokens(infile, infileName):
letter = Plex.Range("AZaz")
digit = Plex.Range("09")
name = letter + Plex.Rep(letter | digit)
lpar = Plex.Str('(')
rpar = Plex.Str(')')
comma = Plex.Str(',')
comment = Plex.Str("#") + Plex.Rep(Plex.AnyBut("\n"))
space = Plex.Any(" \t\n")
lexicon = Plex.Lexicon([
(name, 'word'),
(lpar, 'lpar'),
(rpar, 'rpar'),
(comma, 'comma'),
(comment, Plex.IGNORE),
(space, Plex.IGNORE),
])
scanner = Plex.Scanner(lexicon, infile, infileName)
while 1:
tokenType, token = scanner.read()
name, lineNo, columnNo = scanner.position()
if tokenType == None:
tokType = EOFTokType
token = None
elif tokenType == 'word':
tokType = WordTokType
elif tokenType == 'lpar':
tokType = LParTokType
elif tokenType == 'rpar':
tokType = RParTokType
elif tokenType == 'comma':
tokType = CommaTokType
else:
tokType = NoneTokType
tok = token
yield (tokType, tok, lineNo)
def test(infileName):
parser = ProgParser()
#ipshell('(test) #1\nCtrl-D to exit')
result = None
try:
result = parser.parseFile(infileName)
except ParseError, exp:
sys.stderr.write('ParseError: (%d) %s\n' % \
(exp.getLineNo(), exp.getMsg()))
if result:
result.show(0)
USAGE_TEXT = """
Usage:
python python_201_rparser_plex.py [options]
Options:
-h, --help Display this help message.
Example:
python python_201_rparser_plex.py myfile.txt
"""
def usage():
print USAGE_TEXT
sys.exit(-1)
def main():
args = sys.argv[1:]
try:
opts, args = getopt.getopt(args, 'h', ['help'])
except:
usage()
for opt, val in opts:
if opt in ('-h', '--help'):
usage()
if len(args) != 1:
usage()
infileName = args[0]
test(infileName)
if __name__ == '__main__':
main()
#import pdb
#pdb.run('main()')
Download as text (original file name:
Examples/python_201_rparser_plex.py).
And, here is a sample of the data we can apply this parser to:
# Test for recursive descent parser and Plex.
# Command #1
aaa()
# Command #2
bbb (ccc()) # An end of line comment.
# Command #3
ddd(eee(), fff(ggg(), hhh(), iii()))
# End of test
Download as text (original file name:
Examples/python_201_rparser_plex_data.txt).
Comments:
We can now put comments in our input, and they will be ignored.
Comments begin with a “#” and continue to the end of line. See the
definition of comment in functiongenTokens.
This tokenizer does not require us to separate tokens with
whitespace as did the simple tokenizer in the earlier version of our
recursive descent parser.
The changes we made over the earlier version were to:
- Import Plex.
- Replace the definition of the tokenizer functiongenTokens.
- Change the call to genTokens so that the call passes in the file
name, which is needed to create the scanner.
Our new version of genTokens does the following:
- Create patterns for scanning.
- Create a lexicon (an instance of Plex.Lexicon), which uses the
patterns.
- Create a scanner (an instance of Plex.Scanner), which uses the
lexicon.
- Execute a loop that reads tokens (from the scanner) and “yields”
each one.