Source

Chapter 8. HTML Processing

8.1. Diving in

I often see questions on comp.lang.python (http://groups.google.com/groups? group=comp.lang.python) like “How can I list all the [headers|images|links] in my HTML document?” “How do I parse/translate/munge the text of my HTML document but leave the tags alone?” “How can I add/remove/quote attributes of all my HTML tags at once?” This chapter will answer all of these questions.

Here is a complete, working Python program in two parts. The first part, BaseHTMLProcessor.py, is a generic tool to help you process HTML files by walking through the tags and text blocks. The second part, dialect.py, is an example of how to use BaseHTMLProcessor.py to translate the text of an HTML document but leave the tags alone. Read the doc strings and comments to get an overview of what’s going on. Most of it will seem like black magic, because it’s not obvious how any of these class methods ever get called. Don’t worry, all will be revealed in due time.

Example 8.1. BaseHTMLProcessor.py

If you have not already done so, you can download this and other examples ( http://diveintopython.org/download/diveintopython-examples-5.4.zip) used in this book.

from sgmllib import SGMLParser
import htmlentitydefs

class BaseHTMLProcessor(SGMLParser):
    def reset(self):
        # extend (called by SGMLParser.__init__)
        self.pieces = []
        SGMLParser.reset(self)

    def unknown_starttag(self, tag, attrs):
        # called for each start tag
        # attrs is a list of (attr, value) tuples
        # e.g. for <pre class="screen">, tag="pre", attrs=[("class", "screen")]
        # Ideally we would like to reconstruct original tag and attributes, but
        # we may end up quoting attribute values that weren't quoted in the source
        # document, or we may change the type of quotes around the attribute value
        # (single to double quotes).
        # Note that improperly embedded non-HTML code (like client-side Javascript)
        # may be parsed incorrectly by the ancestor, causing runtime script errors.
        # All non-HTML code must be enclosed in HTML comment tags (<!-- code -->)
        # to ensure that it will pass through this parser unaltered (in handle_comment).
        strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
        self.pieces.append("<%(tag)s%(strattrs)s>" % locals())

    def unknown_endtag(self, tag):
        # called for each end tag, e.g. for </pre>, tag will be "pre"
        # Reconstruct the original end tag.
        self.pieces.append("</%(tag)s>" % locals())

    def handle_charref(self, ref):
        # called for each character reference, e.g. for "&#160;", ref will be "160"
        # Reconstruct the original character reference.
        self.pieces.append("&#%(ref)s;" % locals())

    def handle_entityref(self, ref):
        # called for each entity reference, e.g. for "&copy;", ref will be "copy"
        # Reconstruct the original entity reference.
        self.pieces.append("&%(ref)s" % locals())
        # standard HTML entities are closed with a semicolon; other entities are not
        if htmlentitydefs.entitydefs.has_key(ref):
            self.pieces.append(";")

    def handle_data(self, text):
        # called for each block of plain text, i.e. outside of any tag and
        # not containing any character or entity references
        # Store the original text verbatim.
        self.pieces.append(text)

    def handle_comment(self, text):
        # called for each HTML comment, e.g. <!-- insert Javascript code here -->
        # Reconstruct the original comment.
        # It is especially important that the source document enclose client-side
        # code (like Javascript) within comments so it can pass through this
        # processor undisturbed; see comments in unknown_starttag for details.
        self.pieces.append("<!--%(text)s-->" % locals())

    def handle_pi(self, text):
        # called for each processing instruction, e.g. <?instruction>
        # Reconstruct original processing instruction.
        self.pieces.append("<?%(text)s>" % locals())

    def handle_decl(self, text):
        # called for the DOCTYPE, if present, e.g.
        # <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        #     "http://www.w3.org/TR/html4/loose.dtd">
        # Reconstruct original DOCTYPE
        self.pieces.append("<!%(text)s>" % locals())

    def output(self):
        """Return processed HTML as a single string"""
        return "".join(self.pieces)

Example 8.2. dialect.py

import re
from BaseHTMLProcessor import BaseHTMLProcessor

class Dialectizer(BaseHTMLProcessor):
    subs = ()

    def reset(self):
        # extend (called from __init__ in ancestor)
        # Reset all data attributes
        self.verbatim = 0
        BaseHTMLProcessor.reset(self)

    def start_pre(self, attrs):
        # called for every <pre> tag in HTML source
        # Increment verbatim mode count, then handle tag like normal
        self.verbatim += 1
        self.unknown_starttag("pre", attrs)

    def end_pre(self):
        # called for every </pre> tag in HTML source
        # Decrement verbatim mode count
        self.unknown_endtag("pre")
        self.verbatim -= 1

    def handle_data(self, text):
        # override
        # called for every block of text in HTML source
        # If in verbatim mode, save text unaltered;
        # otherwise process the text with a series of substitutions
        self.pieces.append(self.verbatim and text or self.process(text))

    def process(self, text):
        # called from handle_data
        # Process text block by performing series of regular expression
        # substitutions (actual substitions are defined in descendant)
        for fromPattern, toPattern in self.subs:
            text = re.sub(fromPattern, toPattern, text)
        return text

class ChefDialectizer(Dialectizer):
    """convert HTML to Swedish Chef-speak

    based on the classic chef.x, copyright (c) 1992, 1993 John Hagerman
    """
    subs = ((r'a([nu])', r'u\1'),
            (r'A([nu])', r'U\1'),
            (r'a\B', r'e'),
            (r'A\B', r'E'),
            (r'en\b', r'ee'),
            (r'\Bew', r'oo'),
            (r'\Be\b', r'e-a'),
            (r'\be', r'i'),
            (r'\bE', r'I'),
            (r'\Bf', r'ff'),
            (r'\Bir', r'ur'),
            (r'(\w*?)i(\w*?)$', r'\1ee\2'),
            (r'\bow', r'oo'),
            (r'\bo', r'oo'),
            (r'\bO', r'Oo'),
            (r'the', r'zee'),
            (r'The', r'Zee'),
            (r'th\b', r't'),
            (r'\Btion', r'shun'),
            (r'\Bu', r'oo'),
            (r'\BU', r'Oo'),
            (r'v', r'f'),
            (r'V', r'F'),
            (r'w', r'w'),
            (r'W', r'W'),
            (r'([a-z])[.]', r'\1.  Bork Bork Bork!'))

class FuddDialectizer(Dialectizer):
    """convert HTML to Elmer Fudd-speak"""
    subs = ((r'[rl]', r'w'),
            (r'qu', r'qw'),
            (r'th\b', r'f'),
            (r'th', r'd'),
            (r'n[.]', r'n, uh-hah-hah-hah.'))

class OldeDialectizer(Dialectizer):
    """convert HTML to mock Middle English"""
    subs = ((r'i([bcdfghjklmnpqrstvwxyz])e\b', r'y\1'),
            (r'i([bcdfghjklmnpqrstvwxyz])e', r'y\1\1e'),
            (r'ick\b', r'yk'),
            (r'ia([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
            (r'e[ea]([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
            (r'([bcdfghjklmnpqrstvwxyz])y', r'\1ee'),
            (r'([bcdfghjklmnpqrstvwxyz])er', r'\1re'),
            (r'([aeiou])re\b', r'\1r'),
            (r'ia([bcdfghjklmnpqrstvwxyz])', r'i\1e'),
            (r'tion\b', r'cioun'),
            (r'ion\b', r'ioun'),
            (r'aid', r'ayde'),
            (r'ai', r'ey'),
            (r'ay\b', r'y'),
            (r'ay', r'ey'),
            (r'ant', r'aunt'),
            (r'ea', r'ee'),
            (r'oa', r'oo'),
            (r'ue', r'e'),
            (r'oe', r'o'),
            (r'ou', r'ow'),
            (r'ow', r'ou'),
            (r'\bhe', r'hi'),
            (r've\b', r'veth'),
            (r'se\b', r'e'),
            (r"'s\b", r'es'),
            (r'ic\b', r'ick'),
            (r'ics\b', r'icc'),
            (r'ical\b', r'ick'),
            (r'tle\b', r'til'),
            (r'll\b', r'l'),
            (r'ould\b', r'olde'),
            (r'own\b', r'oune'),
            (r'un\b', r'onne'),
            (r'rry\b', r'rye'),
            (r'est\b', r'este'),
            (r'pt\b', r'pte'),
            (r'th\b', r'the'),
            (r'ch\b', r'che'),
            (r'ss\b', r'sse'),
            (r'([wybdp])\b', r'\1e'),
            (r'([rnt])\b', r'\1\1e'),
            (r'from', r'fro'),
            (r'when', r'whan'))

def translate(url, dialectName="chef"):
    """fetch URL and translate using dialect

    dialect in ("chef", "fudd", "olde")"""
    import urllib
    sock = urllib.urlopen(url)
    htmlSource = sock.read()
    sock.close()
    parserName = "%sDialectizer" % dialectName.capitalize()
    parserClass = globals()[parserName]
    parser = parserClass()
    parser.feed(htmlSource)
    parser.close()
    return parser.output()

def test(url):
    """test all dialects against URL"""
    for dialect in ("chef", "fudd", "olde"):
        outfile = "%s.html" % dialect
        fsock = open(outfile, "wb")
        fsock.write(translate(url, dialect))
        fsock.close()
        import webbrowser
        webbrowser.open_new(outfile)

if __name__ == "__main__":
    test("http://diveintopython.org/odbchelper_list.html")

Example 8.3. Output of dialect.py

Running this script will translate Section 3.2, ??Introducing Lists?? into mock Swedish Chef-speak (../native_data_types/chef.html) (from The Muppets), mock Elmer Fudd-speak (../native_data_types/fudd.html) (from Bugs Bunny cartoons), and mock Middle English (../native_data_types/olde.html) (loosely based on Chaucer’s The Canterbury Tales). If you look at the HTML source of the output pages, you’ll see that all the HTML tags and attributes are untouched, but the text between the tags has been “translated” into the mock language. If you look closer, you’ll see that, in fact, only the titles and paragraphs were translated; the code listings and screen examples were left untouched.

<div class="abstract">
<p>Lists awe <span class="application">Pydon</span>'s wowkhowse datatype.
If youw onwy expewience wif wists is awways in
<span class="application">Visuaw Basic</span> ow (God fowbid) de datastowe
in <span class="application">Powewbuiwdew</span>, bwace youwsewf fow
<span class="application">Pydon</span> wists.</p>
</div>

8.2. Introducing sgmllib.py

HTML processing is broken into three steps: breaking down the HTML into its constituent pieces, fiddling with the pieces, and reconstructing the pieces into HTML again. The first step is done by sgmllib.py, a part of the standard Python library.

The key to understanding this chapter is to realize that HTML is not just text, it is structured text. The structure is derived from the more-or-less-hierarchical sequence of start tags and end tags. Usually you don’t work with HTML this way; you work with it textually in a text editor, or visually in a web browser or web authoring tool. sgmllib.py presents HTML structurally.

sgmllib.py contains one important class: SGMLParser. SGMLParser parses HTML into useful pieces, like start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece, it calls a method on itself based on what it found. In order to use the parser, you subclass the SGMLParser class and override these methods. This is what I meant when I said that it presents HTML structurally: the structure of the HTML determines the sequence of method calls and the arguments passed to each method.

SGMLParser parses HTML into 8 kinds of data, and calls a separate method for each of them:

Start tag
An HTML tag that starts a block, like <html>, <head>, <body>, or <pre>, or a standalone tag like <br> or <img>. When it finds a start tag tagname, SGMLParser will look for a method called start_tagname or do_tagname. For instance, when it finds a <pre> tag, it will look for a start_pre or do_pre method. If found, SGMLParser calls this method with a list of the tag’s attributes; otherwise, it calls unknown_starttag with the tag name and list of attributes.
End tag
An HTML tag that ends a block, like </html>, </head>, </body>, or </pre>. When it finds an end tag, SGMLParser will look for a method called end_tagname. If found, SGMLParser calls this method, otherwise it calls unknown_endtag with the tag name.
Character reference
An escaped character referenced by its decimal or hexadecimal equivalent, like &#160;. When found, SGMLParser calls handle_charref with the text of the decimal or hexadecimal character equivalent.
Entity reference
An HTML entity, like &copy;. When found, SGMLParser calls handle_entityref with the name of the HTML entity.
Comment
An HTML comment, enclosed in <!– ... –>. When found, SGMLParser calls handle_comment with the body of the comment.
Processing instruction
An HTML processing instruction, enclosed in <? ... >. When found, SGMLParser calls handle_pi with the body of the processing instruction.
Declaration
An HTML declaration, such as a DOCTYPE, enclosed in <! ... >. When found, SGMLParser calls handle_decl with the body of the declaration.
Text data

A block of text. Anything that doesn’t fit into the other 7 categories. When found, SGMLParser calls handle_data with the text.

Important: Language evolution: DOCTYPE Python 2.0 had a bug where SGMLParser would not recognize declarations at all (handle_decl would never be called), which meant that DOCTYPEs were silently ignored. This is fixed in Python 2.1.

sgmllib.py comes with a test suite to illustrate this. You can run sgmllib.py, passing the name of an HTML file on the command line, and it will print out the tags and other elements as it parses them. It does this by subclassing the SGMLParser class and defining unknown_starttag, unknown_endtag, handle_data and other methods which simply print their arguments.

Tip: Specifying command line arguments in Windows In the ActivePython IDE on Windows, you can specify command line arguments in the “Run script” dialog. Separate multiple arguments with spaces.

Example 8.4. Sample test of sgmllib.py

Here is a snippet from the table of contents of the HTML version of this book. Of course your paths may vary. (If you haven’t downloaded the HTML version of the book, you can do so at http://diveintopython.org/. c:python23lib> type “c:downloadsdiveintopythonhtmltocindex.html”

<!DOCTYPE html
  PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

      <title>Dive Into Python</title>
      <link rel="stylesheet" href="diveintopython.css" type="text/css">

... rest of file omitted for brevity ...

Running this through the test suite of sgmllib.py yields this output:

c:\python23\lib> python sgmllib.py "c:\downloads\diveintopython\html\toc\index.html"
data: '\n\n'
start tag: <html lang="en" >
data: '\n   '
start tag: <head>
data: '\n      '
start tag: <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >
data: '\n   \n      '
start tag: <title>
data: 'Dive Into Python'
end tag: </title>
data: '\n      '
start tag: <link rel="stylesheet" href="diveintopython.css" type="text/css" >
data: '\n      '

... rest of output omitted for brevity ...

Here’s the roadmap for the rest of the chapter:

  • Subclass SGMLParser to create classes that extract interesting data out of HTML documents.
  • Subclass SGMLParser to create BaseHTMLProcessor, which overrides all 8 handler methods and uses them to reconstruct the original HTML from the pieces.
  • Subclass BaseHTMLProcessor to create Dialectizer, which adds some methods to process specific HTML tags specially, and overrides the handle_data method to provide a framework for processing the text blocks between the HTML tags.
  • Subclass Dialectizer to create classes that define text processing rules used by Dialectizer.handle_data.
  • Write a test suite that grabs a real web page from http:// diveintopython.org/ and processes it.

Along the way, you’ll also learn about locals, globals, and dictionary-based string formatting.

8.3. Extracting data from HTML documents

To extract data from HTML documents, subclass the SGMLParser class and define methods for each tag or entity you want to capture.

The first step to extracting data from an HTML document is getting some HTML. If you have some HTML lying around on your hard drive, you can use file functions to read it, but the real fun begins when you get HTML from live web pages.

Example 8.5. Introducing urllib

>>> import urllib                                       (1)
>>> sock = urllib.urlopen("http://diveintopython.org/") (2)
>>> htmlSource = sock.read()                            (3)
>>> sock.close()                                        (4)
>>> print htmlSource                                    (5)
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head>
      <meta http-equiv='Content-Type' content='text/html; charset=ISO-8859-1'>
   <title>Dive Into Python</title>
<link rel='stylesheet' href='diveintopython.css' type='text/css'>
<link rev='made' href='mailto:mark@diveintopython.org'>
<meta name='keywords' content='Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free'>
<meta name='description' content='a free Python tutorial for experienced programmers'>
</head>
<body bgcolor='white' text='black' link='#0000FF' vlink='#840084' alink='#0000FF'>
<table cellpadding='0' cellspacing='0' border='0' width='100%'>
<tr><td class='header' width='1%' valign='top'>diveintopython.org</td>
<td width='99%' align='right'><hr size='1' noshade></td></tr>
<tr><td class='tagline' colspan='2'>Python&nbsp;for&nbsp;experienced&nbsp;programmers</td></tr>

[...snip...]

  1. The urllib module is part of the standard Python library. It contains functions for getting information about and actually retrieving data from Internet-based URLs (mainly web pages).
  2. The simplest use of urllib is to retrieve the entire text of a web page using the urlopen function. Opening a URL is similar to opening a file. The return value of urlopen is a file-like object, which has some of the same methods as a file object.
  3. The simplest thing to do with the file-like object returned by urlopen is read, which reads the entire HTML of the web page into a single string. The object also supports readlines, which reads the text line by line into a list.
  4. When you’re done with the object, make sure to close it, just like a normal file object.
  5. You now have the complete HTML of the home page of http:// diveintopython.org/ in a string, and you’re ready to parse it.

Example 8.6. Introducing urllister.py

If you have not already done so, you can download this and other examples ( http://diveintopython.org/download/diveintopython-examples-5.4.zip) used in this book.

from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):                              (1)
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):                     (2)
        href = [v for k, v in attrs if k=='href'] (3) (4)
        if href:
            self.urls.extend(href)
  1. reset is called by the __init__ method of SGMLParser, and it can also be called manually once an instance of the parser has been created. So if you need to do any initialization, do it in reset, not in __init__, so that it will be re-initialized properly when someone re-uses a parser instance.
  2. start_a is called by SGMLParser whenever it finds an <a> tag. The tag may contain an href attribute, and/or other attributes, like name or title. The attrs parameter is a list of tuples, [(attribute, value), (attribute, value), ...]. Or it may be just an <a>, a valid (if useless) HTML tag, in which case attrs would be an empty list.
  3. You can find out whether this <a> tag has an href attribute with a simple multi-variable list comprehension.
  4. String comparisons like k==’href’ are always case-sensitive, but that’s safe in this case, because SGMLParser converts attribute names to lowercase while building attrs.

Example 8.7. Using urllister.py

>>> import urllib, urllister
>>> usock = urllib.urlopen("http://diveintopython.org/")
>>> parser = urllister.URLLister()
>>> parser.feed(usock.read())         (1)
>>> usock.close()                     (2)
>>> parser.close()                    (3)
>>> for url in parser.urls: print url (4)
toc/index.html
#download
#languages
toc/index.html
appendix/history.html
download/diveintopython-html-5.0.zip
download/diveintopython-pdf-5.0.zip
download/diveintopython-word-5.0.zip
download/diveintopython-text-5.0.zip
download/diveintopython-html-flat-5.0.zip
download/diveintopython-xml-5.0.zip
download/diveintopython-common-5.0.zip

... rest of output omitted for brevity ...

  1. Call the feed method, defined in SGMLParser, to get HTML into the parser.[1 ] It takes a string, which is what usock.read() returns.
  2. Like files, you should close your URL objects as soon as you’re done with them.
  3. You should close your parser object, too, but for a different reason. You’ve read all the data and fed it to the parser, but the feed method isn’t guaranteed to have actually processed all the HTML you give it; it may buffer it, waiting for more. Be sure to call close to flush the buffer and force everything to be fully parsed.
  4. Once the parser is closed, the parsing is complete, and parser.urls contains a list of all the linked URLs in the HTML document. (Your output may look different, if the download links have been updated by the time you read this.)

8.4. Introducing BaseHTMLProcessor.py

SGMLParser doesn’t produce anything by itself. It parses and parses and parses, and it calls a method for each interesting thing it finds, but the methods don’t do anything. SGMLParser is an HTML consumer: it takes HTML and breaks it down into small, structured pieces. As you saw in the previous section, you can subclass SGMLParser to define classes that catch specific tags and produce useful things, like a list of all the links on a web page. Now you’ll take this one step further by defining a class that catches everything SGMLParser throws at it and reconstructs the complete HTML document. In technical terms, this class will be an HTML producer.

BaseHTMLProcessor subclasses SGMLParser and provides all 8 essential handler methods: unknown_starttag, unknown_endtag, handle_charref, handle_entityref, handle_comment, handle_pi, handle_decl, and handle_data.

Example 8.8. Introducing BaseHTMLProcessor

class BaseHTMLProcessor(SGMLParser):
    def reset(self):                        (1)
        self.pieces = []
        SGMLParser.reset(self)

    def unknown_starttag(self, tag, attrs): (2)
        strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
        self.pieces.append("<%(tag)s%(strattrs)s>" % locals())

    def unknown_endtag(self, tag):          (3)
        self.pieces.append("</%(tag)s>" % locals())

    def handle_charref(self, ref):          (4)
        self.pieces.append("&#%(ref)s;" % locals())

    def handle_entityref(self, ref):        (5)
        self.pieces.append("&%(ref)s" % locals())
        if htmlentitydefs.entitydefs.has_key(ref):
            self.pieces.append(";")

    def handle_data(self, text):            (6)
        self.pieces.append(text)

    def handle_comment(self, text):         (7)
        self.pieces.append("<!--%(text)s-->" % locals())

    def handle_pi(self, text):              (8)
        self.pieces.append("<?%(text)s>" % locals())

    def handle_decl(self, text):
        self.pieces.append("<!%(text)s>" % locals())
  1. reset, called by SGMLParser.__init__, initializes self.pieces as an empty list before calling the ancestor method. self.pieces is a data attribute which will hold the pieces of the HTML document you’re constructing. Each handler method will reconstruct the HTML that SGMLParser parsed, and each method will append that string to self.pieces. Note that self.pieces is a list. You might be tempted to define it as a string and just keep appending each piece to it. That would work, but Python is much more efficient at dealing with lists.[2]

  2. Since BaseHTMLProcessor does not define any methods for specific tags (like the start_a method in URLLister), SGMLParser will call unknown_starttag for every start tag. This method takes the tag (tag) and the list of attribute name/value pairs (attrs), reconstructs the original HTML, and appends it to self.pieces. The string formatting here is a little strange; you’ll untangle that (and also the odd-looking locals function) later in this chapter.

  3. Reconstructing end tags is much simpler; just take the tag name and wrap it in the </...> brackets.

  4. When SGMLParser finds a character reference, it calls handle_charref with the bare reference. If the HTML document contains the reference &#160;, ref will be 160. Reconstructing the original complete character reference just involves wrapping ref in &#...; characters.

  5. Entity references are similar to character references, but without the hash mark. Reconstructing the original entity reference requires wrapping ref in &...; characters. (Actually, as an erudite reader pointed out to me, it’s slightly more complicated than this. Only certain standard HTML entites end in a semicolon; other similar-looking entities do not. Luckily for us, the set of standard HTML entities is defined in a dictionary in a Python module called htmlentitydefs. Hence the extra if statement.)

  6. Blocks of text are simply appended to self.pieces unaltered.

  7. HTML comments are wrapped in <!–...–> characters.

  8. Processing instructions are wrapped in <?...> characters.

    Important: Processing HTML with embedded script The HTML specification requires that all non-HTML (like client-side JavaScript) must be enclosed in HTML comments, but not all web pages do this properly (and all modern web browsers are forgiving if they don’t). BaseHTMLProcessor is not forgiving; if script is improperly embedded, it will be parsed as if it were HTML. For instance, if the script contains less-than and equals signs, SGMLParser may incorrectly think that it has found tags and attributes. SGMLParser always converts tags and attribute names to lowercase, which may break the script, and BaseHTMLProcessor always encloses attribute values in double quotes (even if the original HTML document used single quotes or no quotes), which will certainly break the script. Always protect your client-side script within HTML comments.

Example 8.9. BaseHTMLProcessor output

def output(self):               (1)
    """Return processed HTML as a single string"""
    return "".join(self.pieces) (2)
  1. This is the one method in BaseHTMLProcessor that is never called by the ancestor SGMLParser. Since the other handler methods store their reconstructed HTML in self.pieces, this function is needed to join all those pieces into one string. As noted before, Python is great at lists and mediocre at strings, so you only create the complete string when somebody explicitly asks for it.
  2. If you prefer, you could use the join method of the string module instead: string.join(self.pieces, “”)

Further reading

  • W3C (http://www.w3.org/) discusses character and entity references (http: //www.w3.org/TR/REC-html40/charset.html#entities).
  • Python Library Reference (http://www.python.org/doc/current/lib/) confirms your suspicions that the htmlentitydefs module (http:// www.python.org/doc/current/lib/module-htmlentitydefs.html) is exactly what it sounds like.

8.5. locals and globals

Let’s digress from HTML processing for a minute and talk about how Python handles variables. Python has two built-in functions, locals and globals, which provide dictionary-based access to local and global variables.

Remember locals? You first saw it here:

def unknown_starttag(self, tag, attrs):
    strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
    self.pieces.append("<%(tag)s%(strattrs)s>" % locals())

No, wait, you can’t learn about locals yet. First, you need to learn about namespaces. This is dry stuff, but it’s important, so pay attention.

Python uses what are called namespaces to keep track of variables. A namespace is just like a dictionary where the keys are names of variables and the dictionary values are the values of those variables. In fact, you can access a namespace as a Python dictionary, as you’ll see in a minute.

At any particular point in a Python program, there are several namespaces available. Each function has its own namespace, called the local namespace, which keeps track of the function’s variables, including function arguments and locally defined variables. Each module has its own namespace, called the global namespace, which keeps track of the module’s variables, including functions, classes, any other imported modules, and module-level variables and constants. And there is the built-in namespace, accessible from any module, which holds built-in functions and exceptions.

When a line of code asks for the value of a variable x, Python will search for that variable in all the available namespaces, in order:

  1. local namespace - specific to the current function or class method. If the function defines a local variable x, or has an argument x, Python will use this and stop searching.
  2. global namespace - specific to the current module. If the module has defined a variable, function, or class called x, Python will use that and stop searching.
  3. built-in namespace - global to all modules. As a last resort, Python will assume that x is the name of built-in function or variable.

If Python doesn’t find x in any of these namespaces, it gives up and raises a NameError with the message There is no variable named ‘x’, which you saw back in Example 3.18, ??Referencing an Unbound Variable??, but you didn’t appreciate how much work Python was doing before giving you that error.

Important: Language evolution: nested scopes Python 2.2 introduced a subtle but important change that affects the namespace search order: nested scopes. In versions of Python prior to 2.2, when you reference a variable within a nested function or lambda function, Python will search for that variable in the current (nested or lambda) function’s namespace, then in the module’s namespace. Python 2.2 will search for the variable in the current (nested or lambda) function’s namespace, then in the parent function’s namespace, then in the module’s namespace. Python 2.1 can work either way; by default, it works like Python 2.0, but you can add the following line of code at the top of your module to make your module work like Python 2.2: from __future__ import nested_scopes

Are you confused yet? Don’t despair! This is really cool, I promise. Like many things in Python, namespaces are directly accessible at run-time. How? Well, the local namespace is accessible via the built-in locals function, and the global (module level) namespace is accessible via the built-in globals function.

Example 8.10. Introducing locals

>>> def foo(arg): (1)
...     x = 1
...     print locals()
...
>>> foo(7)        (2)
{'arg': 7, 'x': 1}
>>> foo('bar')    (3)
{'arg': 'bar', 'x': 1}
  1. The function foo has two variables in its local namespace: arg, whose value is passed in to the function, and x, which is defined within the function.
  2. locals returns a dictionary of name/value pairs. The keys of this dictionary are the names of the variables as strings; the values of the dictionary are the actual values of the variables. So calling foo with 7 prints the dictionary containing the function’s two local variables: arg (7) and x (1).
  3. Remember, Python has dynamic typing, so you could just as easily pass a string in for arg; the function (and the call to locals) would still work just as well. locals works with all variables of all datatypes.

What locals does for the local (function) namespace, globals does for the global (module) namespace. globals is more exciting, though, because a module’s namespace is more exciting.[3] Not only does the module’s namespace include module-level variables and constants, it includes all the functions and classes defined in the module. Plus, it includes anything that was imported into the module.

Remember the difference between from module import and import module? With import module, the module itself is imported, but it retains its own namespace, which is why you need to use the module name to access any of its functions or attributes: module.function. But with from module import, you’re actually importing specific functions and attributes from another module into your own namespace, which is why you access them directly without referencing the original module they came from. With the globals function, you can actually see this happen.

Example 8.11. Introducing globals

Look at the following block of code at the bottom of BaseHTMLProcessor.py:

if __name__ == "__main__":
    for k, v in globals().items():             (1)
        print k, "=", v
  1. Just so you don’t get intimidated, remember that you’ve seen all this before. The globals function returns a dictionary, and you’re iterating through the dictionary using the items method and multi-variable assignment . The only thing new here is the globals function.

Now running the script from the command line gives this output (note that your output may be slightly different, depending on your platform and where you installed Python):

c:\docbook\dip\py> python BaseHTMLProcessor.py
SGMLParser = sgmllib.SGMLParser                (1)
htmlentitydefs = <module 'htmlentitydefs' from 'C:\Python23\lib\htmlentitydefs.py'> (2)
BaseHTMLProcessor = __main__.BaseHTMLProcessor (3)
__name__ = __main__                            (4)
... rest of output omitted for brevity...
  1. SGMLParser was imported from sgmllib, using from module import. That means that it was imported directly into the module’s namespace, and here it is.

  2. Contrast this with htmlentitydefs, which was imported using import. That means that the htmlentitydefs module itself is in the namespace, but the entitydefs variable defined within htmlentitydefs is not.

  3. This module only defines one class, BaseHTMLProcessor, and here it is. Note that the value here is the class itself, not a specific instance of the class.

  4. Remember the if __name__ trick? When running a module (as opposed to importing it from another module), the built-in __name__ attribute is a special value, __main__. Since you ran this module as a script from the command line, __name__ is __main__, which is why the little test code to print the globals got executed.

    Note: Accessing variables dynamically Using the locals and globals functions, you can get the value of arbitrary variables dynamically, providing the variable name as a string. This mirrors the functionality of the getattr function, which allows you to access arbitrary functions dynamically by providing the function name as a string.

There is one other important difference between the locals and globals functions, which you should learn now before it bites you. It will bite you anyway, but at least then you’ll remember learning it.

Example 8.12. locals is read-only, globals is not

def foo(arg):
    x = 1
    print locals()    (1)
    locals()["x"] = 2 (2)
    print "x=",x      (3)

z = 7
print "z=",z
foo(3)
globals()["z"] = 8    (4)
print "z=",z          (5)
  1. Since foo is called with 3, this will print {‘arg’: 3, ‘x’: 1}. This should not be a surprise.
  2. locals is a function that returns a dictionary, and here you are setting a value in that dictionary. You might think that this would change the value of the local variable x to 2, but it doesn’t. locals does not actually return the local namespace, it returns a copy. So changing it does nothing to the value of the variables in the local namespace.
  3. This prints x= 1, not x= 2.
  4. After being burned by locals, you might think that this wouldn’t change the value of z, but it does. Due to internal differences in how Python is implemented (which I’d rather not go into, since I don’t fully understand them myself), globals returns the actual global namespace, not a copy: the exact opposite behavior of locals. So any changes to the dictionary returned by globals directly affect your global variables.
  5. This prints z= 8, not z= 7.

8.6. Dictionary-based string formatting

Why did you learn about locals and globals? So you can learn about dictionary-based string formatting. As you recall, regular string formatting provides an easy way to insert values into strings. Values are listed in a tuple and inserted in order into the string in place of each formatting marker. While this is efficient, it is not always the easiest code to read, especially when multiple values are being inserted. You can’t simply scan through the string in one pass and understand what the result will be; you’re constantly switching between reading the string and reading the tuple of values.

There is an alternative form of string formatting that uses dictionaries instead of tuples of values.

Example 8.13. Introducing dictionary-based string formatting

>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
>>> "%(pwd)s" % params                                    (1)
'secret'
>>> "%(pwd)s is not a good password for %(uid)s" % params (2)
'secret is not a good password for sa'
>>> "%(database)s of mind, %(database)s of body" % params (3)
'master of mind, master of body'
  1. Instead of a tuple of explicit values, this form of string formatting uses a dictionary, params. And instead of a simple %s marker in the string, the marker contains a name in parentheses. This name is used as a key in the params dictionary and subsitutes the corresponding value, secret, in place of the %(pwd)s marker.
  2. Dictionary-based string formatting works with any number of named keys. Each key must exist in the given dictionary, or the formatting will fail with a KeyError.
  3. You can even specify the same key twice; each occurrence will be replaced with the same value.

So why would you use dictionary-based string formatting? Well, it does seem like overkill to set up a dictionary of keys and values simply to do string formatting in the next line; it’s really most useful when you happen to have a dictionary of meaningful keys and values already. Like locals.

Example 8.14. Dictionary-based string formatting in BaseHTMLProcessor.py

def handle_comment(self, text):
    self.pieces.append("<!--%(text)s-->" % locals()) (1)
  1. Using the built-in locals function is the most common use of dictionary-based string formatting. It means that you can use the names of local variables within your string (in this case, text, which was passed to the class method as an argument) and each named variable will be replaced by its value. If text is ‘Begin page footer’, the string formatting “<!–% (text)s–>” % locals() will resolve to the string ‘<!–Begin page footer–> ‘.

Example 8.15. More dictionary-based string formatting

def unknown_starttag(self, tag, attrs):
    strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs]) (1)
    self.pieces.append("<%(tag)s%(strattrs)s>" % locals())                      (2)
  1. When this method is called, attrs is a list of key/value tuples, just like the items of a dictionary, which means you can use multi-variable assignment to iterate through it. This should be a familiar pattern by now, but there’s a lot going on here, so let’s break it down:

    System Message: ERROR/3 (/home/gerard/environments/sphinx-0.5-with-patch/thehazeltree/source/diveintopython/8.rst, line 1040)

    Unexpected indentation.

    a. Suppose attrs is [(‘href’, ‘index.html’), (‘title’, ‘Go to home page’)]. b. In the first round of the list comprehension, key will get ‘href’, and

    System Message: ERROR/3 (/home/gerard/environments/sphinx-0.5-with-patch/thehazeltree/source/diveintopython/8.rst, line 1042)

    Unexpected indentation.

    value will get ‘index.html’.

    System Message: WARNING/2 (/home/gerard/environments/sphinx-0.5-with-patch/thehazeltree/source/diveintopython/8.rst, line 1043)

    Block quote ends without a blank line; unexpected unindent.

    c. The string formatting ‘ %s=”%s”’ % (key, value) will resolve to ‘ href=

    “index.html”’. This string becomes the first element of the list comprehension’s return value.

    System Message: WARNING/2 (/home/gerard/environments/sphinx-0.5-with-patch/thehazeltree/source/diveintopython/8.rst, line 1046)

    Definition list ends without a blank line; unexpected unindent.

    d. In the second round, key will get ‘title’, and value will get ‘Go to

    home page’.

    System Message: WARNING/2 (/home/gerard/environments/sphinx-0.5-with-patch/thehazeltree/source/diveintopython/8.rst, line 1048)

    Definition list ends without a blank line; unexpected unindent.

    e. The string formatting will resolve to ‘ title=”Go to home page”’. f. The list comprehension returns a list of these two resolved strings, and

    System Message: ERROR/3 (/home/gerard/environments/sphinx-0.5-with-patch/thehazeltree/source/diveintopython/8.rst, line 1050)

    Unexpected indentation.

    strattrs will join both elements of this list together to form ‘ href= “index.html” title=”Go to home page”’.

  2. Now, using dictionary-based string formatting, you insert the value of tag and strattrs into a string. So if tag is ‘a’, the final result would be ‘<a href=”index.html” title=”Go to home page”>’, and that is what gets appended to self.pieces.

    Important: Performance issues with locals Using dictionary-based string formatting with locals is a convenient way of making complex string formatting expressions more readable, but it comes with a price. There is a slight performance hit in making the call to locals, since locals builds a copy of the local namespace.

8.7. Quoting attribute values

A common question on comp.lang.python (http://groups.google.com/groups?group= comp.lang.python) is “I have a bunch of HTML documents with unquoted attribute values, and I want to properly quote them all. How can I do this?”[4] (This is generally precipitated by a project manager who has found the HTML -is-a-standard religion joining a large project and proclaiming that all pages must validate against an HTML validator. Unquoted attribute values are a common violation of the HTML standard.) Whatever the reason, unquoted attribute values are easy to fix by feeding HTML through BaseHTMLProcessor.

BaseHTMLProcessor consumes HTML (since it’s descended from SGMLParser) and produces equivalent HTML, but the HTML output is not identical to the input. Tags and attribute names will end up in lowercase, even if they started in uppercase or mixed case, and attribute values will be enclosed in double quotes, even if they started in single quotes or with no quotes at all. It is this last side effect that you can take advantage of.

Example 8.16. Quoting attribute values

>>> htmlSource = """        (1)
...     <html>
...     <head>
...     <title>Test page</title>
...     </head>
...     <body>
...     <ul>
...     <li><a href=index.html>Home</a></li>
...     <li><a href=toc.html>Table of contents</a></li>
...     <li><a href=history.html>Revision history</a></li>
...     </body>
...     </html>
...     """
>>> from BaseHTMLProcessor import BaseHTMLProcessor
>>> parser = BaseHTMLProcessor()
>>> parser.feed(htmlSource) (2)
>>> print parser.output()   (3)
<html>
<head>
<title>Test page</title>
</head>
<body>
<ul>
<li><a href="index.html">Home</a></li>
<li><a href="toc.html">Table of contents</a></li>
<li><a href="history.html">Revision history</a></li>
</body>
</html>
  1. Note that the attribute values of the href attributes in the <a> tags are not properly quoted. (Also note that you’re using triple quotes for something other than a doc string. And directly in the IDE, no less. They’re very useful.)
  2. Feed the parser.
  3. Using the output function defined in BaseHTMLProcessor, you get the output as a single string, complete with quoted attribute values. While this may seem anti-climactic, think about how much has actually happened here: SGMLParser parsed the entire HTML document, breaking it down into tags, refs, data, and so forth; BaseHTMLProcessor used those elements to reconstruct pieces of HTML (which are still stored in parser.pieces, if you want to see them); finally, you called parser.output, which joined all the pieces of HTML into one string.

8.8. Introducing dialect.py

Dialectizer is a simple (and silly) descendant of BaseHTMLProcessor. It runs blocks of text through a series of substitutions, but it makes sure that anything within a <pre>...</pre> block passes through unaltered.

To handle the <pre> blocks, you define two methods in Dialectizer: start_pre and end_pre.

Example 8.17. Handling specific tags

def start_pre(self, attrs):             (1)
    self.verbatim += 1                  (2)
    self.unknown_starttag("pre", attrs) (3)

def end_pre(self):                      (4)
    self.unknown_endtag("pre")          (5)
    self.verbatim -= 1                  (6)
  1. start_pre is called every time SGMLParser finds a <pre> tag in the HTML source. (In a minute, you’ll see exactly how this happens.) The method takes a single parameter, attrs, which contains the attributes of the tag (if any). attrs is a list of key/value tuples, just like unknown_starttag takes.
  2. In the reset method, you initialize a data attribute that serves as a counter for <pre> tags. Every time you hit a <pre> tag, you increment the counter; every time you hit a </pre> tag, you’ll decrement the counter. (You could just use this as a flag and set it to 1 and reset it to 0, but it’s just as easy to do it this way, and this handles the odd (but possible) case of nested <pre> tags.) In a minute, you’ll see how this counter is put to good use.
  3. That’s it, that’s the only special processing you do for <pre> tags. Now you pass the list of attributes along to unknown_starttag so it can do the default processing.
  4. end_pre is called every time SGMLParser finds a </pre> tag. Since end tags can not contain attributes, the method takes no parameters.
  5. First, you want to do the default processing, just like any other end tag.
  6. Second, you decrement your counter to signal that this <pre> block has been closed.

At this point, it’s worth digging a little further into SGMLParser. I’ve claimed repeatedly (and you’ve taken it on faith so far) that SGMLParser looks for and calls specific methods for each tag, if they exist. For instance, you just saw the definition of start_pre and end_pre to handle <pre> and </pre>. But how does this happen? Well, it’s not magic, it’s just good Python coding.

Example 8.18. SGMLParser

def finish_starttag(self, tag, attrs):               (1)
    try:
        method = getattr(self, 'start_' + tag)       (2)
    except AttributeError:                           (3)
        try:
            method = getattr(self, 'do_' + tag)      (4)
        except AttributeError:
            self.unknown_starttag(tag, attrs)        (5)
            return -1
        else:
            self.handle_starttag(tag, method, attrs) (6)
            return 0
    else:
        self.stack.append(tag)
        self.handle_starttag(tag, method, attrs)
        return 1                                     (7)

def handle_starttag(self, tag, method, attrs):
    method(attrs)                                    (8)
  1. At this point, SGMLParser has already found a start tag and parsed the attribute list. The only thing left to do is figure out whether there is a specific handler method for this tag, or whether you should fall back on the default method (unknown_starttag).
  2. The “magic” of SGMLParser is nothing more than your old friend, getattr. What you may not have realized before is that getattr will find methods defined in descendants of an object as well as the object itself. Here the object is self, the current instance. So if tag is ‘pre’, this call to getattr will look for a start_pre method on the current instance, which is an instance of the Dialectizer class.
  3. getattr raises an AttributeError if the method it’s looking for doesn’t exist in the object (or any of its descendants), but that’s okay, because you wrapped the call to getattr inside a try...except block and explicitly caught the AttributeError.
  4. Since you didn’t find a start_xxx method, you’ll also look for a do_xxx method before giving up. This alternate naming scheme is generally used for standalone tags, like <br>, which have no corresponding end tag. But you can use either naming scheme; as you can see, SGMLParser tries both for every tag. (You shouldn’t define both a start_xxx and do_xxx handler method for the same tag, though; only the start_xxx method will get called.)
  5. Another AttributeError, which means that the call to getattr failed with do_xxx. Since you found neither a start_xxx nor a do_xxx method for this tag, you catch the exception and fall back on the default method, unknown_starttag.
  6. Remember, try...except blocks can have an else clause, which is called if no exception is raised during the try...except block. Logically, that means that you did find a do_xxx method for this tag, so you’re going to call it.
  7. By the way, don’t worry about these different return values; in theory they mean something, but they’re never actually used. Don’t worry about the self.stack.append(tag) either; SGMLParser keeps track internally of whether your start tags are balanced by appropriate end tags, but it doesn’t do anything with this information either. In theory, you could use this module to validate that your tags were fully balanced, but it’s probably not worth it, and it’s beyond the scope of this chapter. You have better things to worry about right now.
  8. start_xxx and do_xxx methods are not called directly; the tag, method, and attributes are passed to this function, handle_starttag, so that descendants can override it and change the way all start tags are dispatched. You don’t need that level of control, so you just let this method do its thing, which is to call the method (start_xxx or do_xxx) with the list of attributes. Remember, method is a function, returned from getattr, and functions are objects. (I know you’re getting tired of hearing it, and I promise I’ll stop saying it as soon as I run out of ways to use it to my advantage.) Here, the function object is passed into this dispatch method as an argument, and this method turns around and calls the function. At this point, you don’t need to know what the function is, what it’s named, or where it’s defined; the only thing you need to know about the function is that it is called with one argument, attrs.

Now back to our regularly scheduled program: Dialectizer. When you left, you were in the process of defining specific handler methods for <pre> and </pre> tags. There’s only one thing left to do, and that is to process text blocks with the pre-defined substitutions. For that, you need to override the handle_data method.

Example 8.19. Overriding the handle_data method

def handle_data(self, text):                                         (1)
    self.pieces.append(self.verbatim and text or self.process(text)) (2)
  1. handle_data is called with only one argument, the text to process.
  2. In the ancestor BaseHTMLProcessor, the handle_data method simply appended the text to the output buffer, self.pieces. Here the logic is only slightly more complicated. If you’re in the middle of a <pre>...</pre> block, self.verbatim will be some value greater than 0, and you want to put the text in the output buffer unaltered. Otherwise, you will call a separate method to process the substitutions, then put the result of that into the output buffer. In Python, this is a one-liner, using the and-or trick.

You’re close to completely understanding Dialectizer. The only missing link is the nature of the text substitutions themselves. If you know any Perl, you know that when complex text substitutions are required, the only real solution is regular expressions. The classes later in dialect.py define a series of regular expressions that operate on the text between the HTML tags. But you just had a whole chapter on regular expressions. You don’t really want to slog through regular expressions again, do you? God knows I don’t. I think you’ve learned enough for one chapter.

8.9. Putting it all together

It’s time to put everything you’ve learned so far to good use. I hope you were paying attention.

Example 8.20. The translate function, part 1

def translate(url, dialectName="chef"): (1)
    import urllib                       (2)
    sock = urllib.urlopen(url)          (3)
    htmlSource = sock.read()
    sock.close()
  1. The translate function has an optional argument dialectName, which is a string that specifies the dialect you’ll be using. You’ll see how this is used in a minute.
  2. Hey, wait a minute, there’s an import statement in this function! That’s perfectly legal in Python. You’re used to seeing import statements at the top of a program, which means that the imported module is available anywhere in the program. But you can also import modules within a function, which means that the imported module is only available within the function. If you have a module that is only ever used in one function, this is an easy way to make your code more modular. (When you find that your weekend hack has turned into an 800-line work of art and decide to split it up into a dozen reusable modules, you’ll appreciate this.)
  3. Now you get the source of the given URL.

Example 8.21. The translate function, part 2: curiouser and curiouser

parserName = "%sDialectizer" % dialectName.capitalize() (1)
parserClass = globals()[parserName]                     (2)
parser = parserClass()                                  (3)
  1. capitalize is a string method you haven’t seen before; it simply capitalizes the first letter of a string and forces everything else to lowercase. Combined with some string formatting, you’ve taken the name of a dialect and transformed it into the name of the corresponding Dialectizer class. If dialectName is the string ‘chef’, parserName will be the string ‘ChefDialectizer’.
  2. You have the name of a class as a string (parserName), and you have the global namespace as a dictionary (globals()). Combined, you can get a reference to the class which the string names. (Remember, classes are objects, and they can be assigned to variables just like any other object.) If parserName is the string ‘ChefDialectizer’, parserClass will be the class ChefDialectizer.
  3. Finally, you have a class object (parserClass), and you want an instance of the class. Well, you already know how to do that: call the class like a function. The fact that the class is being stored in a local variable makes absolutely no difference; you just call the local variable like a function, and out pops an instance of the class. If parserClass is the class ChefDialectizer, parser will be an instance of the class ChefDialectizer.

Why bother? After all, there are only 3 Dialectizer classes; why not just use a case statement? (Well, there’s no case statement in Python, but why not just use a series of if statements?) One reason: extensibility. The translate function has absolutely no idea how many Dialectizer classes you’ve defined. Imagine if you defined a new FooDialectizer tomorrow; translate would work by passing ‘foo’ as the dialectName.

Even better, imagine putting FooDialectizer in a separate module, and importing it with from module import. You’ve already seen that this includes it in globals(), so translate would still work without modification, even though FooDialectizer was in a separate file.

Now imagine that the name of the dialect is coming from somewhere outside the program, maybe from a database or from a user-inputted value on a form. You can use any number of server-side Python scripting architectures to dynamically generate web pages; this function could take a URL and a dialect name (both strings) in the query string of a web page request, and output the “translated” web page.

Finally, imagine a Dialectizer framework with a plug-in architecture. You could put each Dialectizer class in a separate file, leaving only the translate function in dialect.py. Assuming a consistent naming scheme, the translate function could dynamic import the appropiate class from the appropriate file, given nothing but the dialect name. (You haven’t seen dynamic importing yet, but I promise to cover it in a later chapter.) To add a new dialect, you would simply add an appropriately-named file in the plug-ins directory (like foodialect.py which contains the FooDialectizer class). Calling the translate function with the dialect name ‘foo’ would find the module foodialect.py, import the class FooDialectizer, and away you go.

Example 8.22. The translate function, part 3

parser.feed(htmlSource) (1)
parser.close()          (2)
return parser.output()  (3)
  1. After all that imagining, this is going to seem pretty boring, but the feed function is what does the entire transformation. You had the entire HTML source in a single string, so you only had to call feed once. However, you can call feed as often as you want, and the parser will just keep parsing. So if you were worried about memory usage (or you knew you were going to be dealing with very large HTML pages), you could set this up in a loop, where you read a few bytes of HTML and fed it to the parser. The result would be the same.
  2. Because feed maintains an internal buffer, you should always call the parser’s close method when you’re done (even if you fed it all at once, like you did). Otherwise you may find that your output is missing the last few bytes.
  3. Remember, output is the function you defined on BaseHTMLProcessor that joins all the pieces of output you’ve buffered and returns them in a single string.

And just like that, you’ve “translated” a web page, given nothing but a URL and the name of a dialect.

Further reading

  • You thought I was kidding about the server-side scripting idea. So did I, until I found this web-based dialectizer (http://rinkworks.com/dialect/). Unfortunately, source code does not appear to be available.

8.10. Summary

Python provides you with a powerful tool, sgmllib.py, to manipulate HTML by turning its structure into an object model. You can use this tool in many different ways.

  • parsing the HTML looking for something specific
  • aggregating the results, like the URL lister
  • altering the structure along the way, like the attribute quoter
  • transforming the HTML into something else by manipulating the text while leaving the tags alone, like the Dialectizer

Along with these examples, you should be comfortable doing all of the following things:

  • Using locals() and globals() to access namespaces
  • Formatting strings using dictionary-based substitutions

[1] The technical term for a parser like SGMLParser is a consumer: it consumes HTML and breaks it down. Presumably, the name feed was chosen to fit into the whole “consumer” motif. Personally, it makes me think of an exhibit in the zoo where there’s just a dark cage with no trees or plants or evidence of life of any kind, but if you stand perfectly still and look really closely you can make out two beady eyes staring back at you from the far left corner, but you convince yourself that that’s just your mind playing tricks on you, and the only way you can tell that the whole thing isn’t just an empty cage is a small innocuous sign on the railing that reads, “Do not feed the parser.” But maybe that’s just me. In any event, it’s an interesting mental image.

[2] The reason Python is better at lists than strings is that lists are mutable but strings are immutable. This means that appending to a list just adds the element and updates the index. Since strings can not be changed after they are created, code like s = s + newpiece will create an entirely new string out of the concatenation of the original and the new piece, then throw away the original string. This involves a lot of expensive memory management, and the amount of effort involved increases as the string gets longer, so doing s = s + newpiece in a loop is deadly. In technical terms, appending n items to a list is O(n), while appending n items to a string is O(n2).

[3] I don’t get out much.

[4] All right, it’s not that common a question. It’s not up there with “What editor should I use to write Python code?” (answer: Emacs) or “Is Python better or worse than Perl?” (answer: “Perl is worse than Python because people wanted it worse.” -Larry Wall, 10/14/1998) But questions about HTML processing pop up in one form or another about once a month, and among those questions, this is a popular one.